DSPRelated.com
Forums

Highest possible speed algo in Vb or C or Pseudocode

Started by gebe January 1, 2008
I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
(DLL)not to compete with FFTW for versatility but SPEED in specialised
medias .
I am at the moment (using the best I could find) getting 36 usecs(1024
pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points
I have "lifted some stuff (articles Pdfs)but the articles sources are
incomplete or unavailable and the pdfs are giving only the maths and I
cannot (too stupid )make the algos.....

gebe


gebe wrote:
> I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM > (DLL)not to compete with FFTW for versatility but SPEED in specialised > medias . > I am at the moment (using the best I could find) getting 36 usecs(1024 > pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points > I have "lifted some stuff (articles Pdfs)but the articles sources are > incomplete or unavailable and the pdfs are giving only the maths and I > cannot (too stupid )make the algos.....
How does 78 uSec compare to FFTW's time? FFTW is optimized for speed. Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
>gebe wrote: >> I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM >> (DLL)not to compete with FFTW for versatility but SPEED in specialised >> medias . >> I am at the moment (using the best I could find) getting 36 usecs(1024 >> pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points >> I have "lifted some stuff (articles Pdfs)but the articles sources are >> incomplete or unavailable and the pdfs are giving only the maths and I >> cannot (too stupid )make the algos..... > >How does 78 uSec compare to FFTW's time? FFTW is optimized for speed. > >Jerry >-- >Engineering is the art of making what you want from things you can get. >����������������������������������������������������������������������� >
>gebe wrote: >> I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM >> (DLL)not to compete with FFTW for versatility but SPEED in specialised >> medias . >> I am at the moment (using the best I could find) getting 36 usecs(1024 >> pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points >> I have "lifted some stuff (articles Pdfs)but the articles sources are >> incomplete or unavailable and the pdfs are giving only the maths and I >> cannot (too stupid )make the algos..... > >How does 78 uSec compare to FFTW's time? FFTW is optimized for speed. > >Jerry >-- >Engineering is the art of making what you want from things you can get. >����������������������������������������������������������������������� >
Answer I know FFTW has a way to rate speed as MFLOPS but I don't understand that way, as for me MFLOPS means Mega Floating Points Ops per second and NORMALLY the efficiency of a FFT is INVERSELY PROPORTIONAL to the number of operations (except difference between Multiplications and ADD/SUBS) My time is calculated from the standard timing tools(averaged of a 100 and thread "prioritized") i.e.RDTSC and QueryPerformanceCounter/QueryPerformanceFrequency. gebe
gebe wrote:

   ...

> Answer > I know FFTW has a way to rate speed as MFLOPS but I don't understand that > way, as for me MFLOPS means Mega Floating Points Ops per second > and NORMALLY the efficiency of a FFT is INVERSELY PROPORTIONAL to the > number of operations (except difference between Multiplications and > ADD/SUBS) > My time is calculated from the standard timing tools(averaged of a 100 and > thread "prioritized") i.e.RDTSC and > QueryPerformanceCounter/QueryPerformanceFrequency.
I suggest that you run FFTW and measure the time it takes. Tome only the FFT itself, nit the code needed to build the framework. Jerry -- Engineering is the art of making what you want from things you can get.
On Jan 2, 1:25 am, "gebe" <g...@telkomsa.net> wrote:
> I know FFTW has a way to rate speed as MFLOPS but I don't understand that > way
As clearly defined at the top of our benchmarks page, the "mflops" number we report is simply a convenient rescaling of the speed ( = 5 N log2 N / [time in microseconds]): http://www.fftw.org/speed/ (You can also download data files with the raw timing numbers if you prefer, as describe on our web page.) Regards, Steven G. Johnson
On Jan 1, 10:06 pm, "gebe" <g...@telkomsa.net> wrote:
> I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM > (DLL)not to compete with FFTW for versatility but SPEED in specialised > medias . > I am at the moment (using the best I could find) getting 36 usecs(1024 > pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points
That's not bad at all for a first try, but you still have a fairly long way to go. Let me assume you are using single precision, which should be more than sufficient for most applications involving such small transforms. In that case, according to the benchmark numbers on our web site (http://www.fftw.org/speed/), FFTW can do a 1024-point complex-data transform in about 10usecs on a 2.4GHz AMD Opteron 275 (32-bit mode), and a 2048-point transform in about 23usecs; naively scaling by the clock speed to 1.8GHz gives 13 and 31usecs, respectively. On a newer machine, a 3GHz Core Duo in 64-bit mode, the times are a bit under 4usecs and 10usecs, respectively. And if your data are purely real, then you can gain an additional factor of 1.5 to 2. (It's fairly hard to get anywhere close to optimal performance on a general-purpose CPU these days if you start with a textbook radix-2 FFT algorithm and try to optimize it, if that is what you are doing.) Regards, Steven G. Johnson
On 3 Jan, 23:46, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Jan 1, 10:06 pm, "gebe" <g...@telkomsa.net> wrote: > > > I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM > > (DLL)not to compete with FFTW for versatility but SPEED in specialised > > medias . > > I am at the moment (using the best I could find) getting 36 usecs(1024 > > pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points > > That's not bad at all for a first try, but you still have a fairly > long way to go.
Have anybody tested what can be gained by switching to C++-style[*] template programming (as opposed to regular code)? I gained a factor 2 on a naive Chebychev polynomial computation by switching from regular code to template programming, using the same naive algorithm in both cases. Rune [*] As far as I know, template programming is only available in C++, but I wouldn't be surprised if it has become available elsewhere.

Rune Allnor wrote:

>>>I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM >>>(DLL)not to compete with FFTW for versatility but SPEED in specialised >>>medias .
> Have anybody tested what can be gained by switching > to C++-style[*] template programming (as opposed to > regular code)?
Here is the real FFT code using <vector> and <complex> It is somewhat faster then the direct implementation. Disclaimer: the code was submitted by the third party for my consideration. I don't know who is the author. /***************************************************/ #ifndef _FFT_H #include <vector> #include <math.h> #include "complex.h" int fft_bah(vector<double>* ps, vector<Complex>* pS){ int N = ps->size(); int Z=N; while(Z>1) { int Z1=Z>>1; float Z2=(float)(Z)/2; if (float(Z1)==Z2) {Z=Z1;} else { return 1;} } pS -> clear(); Complex c; vector<Complex> St; St.clear(); int step, N1; step=0; for (N1=N; N1>1;) { step=step+1; N1= N1>>1; } int k; k=1; for (int i=0; i<ps->size(); i++) {c.r=ps->at(i); c.i=0.0; St.push_back(c);} for (int z=0; z<step; z++) {k=k*2; for (int j=0; j<k; j++) {if (j%2==0) for (int n = 0; n<N/2; n++) {c=St.at(n+N*j/2)+St.at(N/2+n+N*j/2); pS -> push_back(c); } else for (int n = 0; n<N/2; n++) {c=(St.at(N*(j-1)/2+n)-St.at(N/2+n+N*(j-1)/2))*exp(1.0, -2*M_PI*n/N); pS -> push_back(c); } } N=N/2; St.clear(); for (int i=0; i<pS->size(); i++) {c=pS->at(i); St.push_back(c); } pS->clear(); } int z; N=St.size(); for (int i = 0; i<N; i++) { int j=0; int n1 = 0; int m=1; j=i; for (int k1=N; k1>1;) { m=m*2; z=int(j%(k1/2)); n1=n1+(j-z)*m/k1; k1=k1/2; j=z;} c=St.at(n1); pS->push_back(c); } return 0; }; #define _FFT_H #endif /**********************************************/ Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
On Jan 4, 10:58 am, Rune Allnor <all...@tele.ntnu.no> wrote:
> Have anybody tested what can be gained by switching > to C++-style[*] template programming (as opposed to > regular code)?
What do you suggest using the templates for in the case of an FFT? For type polymorphism, it's a coding convenience rather than a performance issue. The complex type is also more a question of convenience. For STL container templates, the vendor-optimized parts aren't too useful here. The fastest FFT algorithms don't really consist of a sequence of BLAS-like operations or similar collective functions where the optimized code in STL vector templates will help you (performance- wise) as far as I can tell; for random access and operations on individual elements, it is going to be equivalent to ordinary arrays/ pointers. For Blitz++ style metaprogramming, they have tried it for FFT (http:// www.oonumerics.org/blitz/examples/fft.html). Basically, you can use templates to write a code generator for hard-coded FFTs of small sizes. (Generating a hard-coded FFT for sizes > 128 or so is impractical due to instruction-cache issues.) However, there's much more to generating fast hard-coded FFTs than just partial evaluation of a radix-2 FFT, which is what Blitz++ did IIRC; one has to worry about code-scheduling issues to minimize register spills (the compiler can't do it for you because it is NP-hard in general, but there are specialized solutions that work well for FFTs), one would like to simplify operations involving trivial powers of roots of unity (which give 1, i, [1+i]/sqrt(2), etcetera, that lead to simplified complex multiplications), etcetera....see our papers on FFTW's code generator. Of course, in principle this could all be done with template metaprogramming, since it is Turing-complete (if I recall correctly), but it doesn't seem like the right language for this sort of thing. In FFTW, we wrote our code generator in Caml, which is more suited for writing compiler-like programs (built-in support for pattern matching, symbolic-expression datatypes, etc). Anyway, all of this is insufficient for FFTs > 128 or so, as I mentioned; at best, it gives you building blocks for larger FFTs, as in FFTW. Regards, Steven G. Johnson