comp.dsp | Highest possible speed algo in Vb or C or Pseudocode

I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
(DLL)not to compete with FFTW for versatility but SPEED in specialised
medias .
I am at the moment (using the best I could find) getting 36 usecs(1024
pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points
I have "lifted some stuff (articles Pdfs)but the articles sources are
incomplete or unavailable and the pdfs are giving only the maths and I
cannot (too stupid )make the algos.....

gebe

Reply by Jerry Avins ●January 1, 20082008-01-01

gebe wrote:
> I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
> (DLL)not to compete with FFTW for versatility but SPEED in specialised
> medias .
> I am at the moment (using the best I could find) getting 36 usecs(1024
> pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points
> I have "lifted some stuff (articles Pdfs)but the articles sources are
> incomplete or unavailable and the pdfs are giving only the maths and I
> cannot (too stupid )make the algos.....

How does 78 uSec compare to FFTW's time? FFTW is optimized for speed.

Jerry
-- 
Engineering is the art of making what you want from things you can get.
&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;

Reply by gebe ●January 2, 20082008-01-02

>gebe wrote:
>> I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
>> (DLL)not to compete with FFTW for versatility but SPEED in specialised
>> medias .
>> I am at the moment (using the best I could find) getting 36 usecs(1024
>> pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points
>> I have "lifted some stuff (articles Pdfs)but the articles sources are
>> incomplete or unavailable and the pdfs are giving only the maths and I
>> cannot (too stupid )make the algos.....
>
>How does 78 uSec compare to FFTW's time? FFTW is optimized for speed.
>
>Jerry
>-- 
>Engineering is the art of making what you want from things you can get.
>&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;
>

Reply by gebe ●January 2, 20082008-01-02

>gebe wrote:
>> I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
>> (DLL)not to compete with FFTW for versatility but SPEED in specialised
>> medias .
>> I am at the moment (using the best I could find) getting 36 usecs(1024
>> pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points
>> I have "lifted some stuff (articles Pdfs)but the articles sources are
>> incomplete or unavailable and the pdfs are giving only the maths and I
>> cannot (too stupid )make the algos.....
>
>How does 78 uSec compare to FFTW's time? FFTW is optimized for speed.
>
>Jerry
>-- 
>Engineering is the art of making what you want from things you can get.
>&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;&#65533;
>
Answer 
I know FFTW has a way to rate speed as MFLOPS but I don't understand that
way, as for me MFLOPS means Mega Floating Points Ops per second
and NORMALLY the efficiency of a FFT is INVERSELY PROPORTIONAL to the
number of operations (except difference between Multiplications  and
ADD/SUBS)
My time is calculated from the standard timing tools(averaged of a 100 and
thread "prioritized") i.e.RDTSC and
QueryPerformanceCounter/QueryPerformanceFrequency.

gebe

Reply by Jerry Avins ●January 2, 20082008-01-02

gebe wrote:

   ...

> Answer 
> I know FFTW has a way to rate speed as MFLOPS but I don't understand that
> way, as for me MFLOPS means Mega Floating Points Ops per second
> and NORMALLY the efficiency of a FFT is INVERSELY PROPORTIONAL to the
> number of operations (except difference between Multiplications  and
> ADD/SUBS)
> My time is calculated from the standard timing tools(averaged of a 100 and
> thread "prioritized") i.e.RDTSC and
> QueryPerformanceCounter/QueryPerformanceFrequency.

I suggest that you run FFTW and measure the time it takes. Tome only the 
FFT itself, nit the code needed to build the framework.

Jerry
-- 
Engineering is the art of making what you want from things you can get.

Reply by Steven G. Johnson ●January 3, 20082008-01-03

On Jan 2, 1:25 am, "gebe" <g...@telkomsa.net> wrote:
> I know FFTW has a way to rate speed as MFLOPS but I don't understand that
> way

As clearly defined at the top of our benchmarks page, the "mflops"
number we report is simply a convenient rescaling of the speed ( = 5 N
log2 N / [time in microseconds]):

                           http://www.fftw.org/speed/

(You can also download data files with the raw timing numbers if you
prefer, as describe on our web page.)

Regards,
Steven G. Johnson

Reply by Steven G. Johnson ●January 3, 20082008-01-03

On Jan 1, 10:06 pm, "gebe" <g...@telkomsa.net> wrote:
> I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
> (DLL)not to compete with FFTW for versatility but SPEED in specialised
> medias .
> I am at the moment (using the best I could find) getting 36 usecs(1024
> pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points

That's not bad at all for a first try, but you still have a fairly
long way to go.

Let me assume you are using single precision, which should be more
than sufficient for most applications involving such small
transforms.  In that case, according to the benchmark numbers on our
web site (http://www.fftw.org/speed/), FFTW can do a 1024-point
complex-data transform in about 10usecs on a 2.4GHz AMD Opteron 275
(32-bit mode), and a 2048-point transform in about 23usecs; naively
scaling by the clock speed to 1.8GHz gives 13 and 31usecs,
respectively.  On a newer machine, a 3GHz Core Duo in 64-bit mode, the
times are a bit under 4usecs and 10usecs, respectively.  And if your
data are purely real, then you can gain an additional factor of 1.5 to
2.

(It's fairly hard to get anywhere close to optimal performance on a
general-purpose CPU these days if you start with a textbook radix-2
FFT algorithm and try to optimize it, if that is what you are doing.)

Regards,
Steven G. Johnson

Reply by Rune Allnor ●January 4, 20082008-01-04

On 3 Jan, 23:46, "Steven G. Johnson" <stev...@alum.mit.edu> wrote:
> On Jan 1, 10:06 pm, "gebe" <g...@telkomsa.net> wrote:
>
> > I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
> > (DLL)not to compete with FFTW for versatility but SPEED in specialised
> > medias .
> > I am at the moment (using the best I could find) getting 36 usecs(1024
> > pts) on a 1800Mhz clocked AMD 32 bits and 78 uSec for 2048 points
>
> That's not bad at all for a first try, but you still have a fairly
> long way to go.

Have anybody tested what can be gained by switching
to C++-style[*] template programming (as opposed to
regular code)?

I gained a factor 2 on a naive Chebychev polynomial
computation by switching from regular code to template
programming, using the same naive algorithm in
both cases.

Rune

[*] As far as I know, template programming is only
    available in C++, but I wouldn't be surprised
    if it has become available elsewhere.

Reply by Vladimir Vassilevsky ●January 4, 20082008-01-04


Rune Allnor wrote:

>>>I am trying to implement a fast 256 to 2048 (in 2's)algorithm in ASM
>>>(DLL)not to compete with FFTW for versatility but SPEED in specialised
>>>medias .

> Have anybody tested what can be gained by switching
> to C++-style[*] template programming (as opposed to
> regular code)?

Here is the real FFT code using <vector> and <complex>
It is somewhat faster then the direct implementation.


Disclaimer: the code was submitted by the third party for my 
consideration. I don't know who is the author.

/***************************************************/

#ifndef _FFT_H

#include <vector>
#include <math.h>
#include "complex.h"


int fft_bah(vector<double>* ps, vector<Complex>* pS){

int N = ps->size();
int Z=N;

while(Z>1)
{
int Z1=Z>>1;
float Z2=(float)(Z)/2;
if (float(Z1)==Z2)
	{Z=Z1;}
else
{
return 1;}
}
	
	pS -> clear();
	
	Complex c;
	
vector<Complex> St;
St.clear();

	
	int step, N1;
	step=0;
	for (N1=N; N1>1;)
		{
		step=step+1;
		N1= N1>>1;
		}
	int k;
	k=1;

for (int i=0; i<ps->size(); i++)
{c.r=ps->at(i);
c.i=0.0;
St.push_back(c);}


	for (int z=0; z<step; z++)
		{k=k*2;

		for (int j=0; j<k; j++)
			{if (j%2==0)
				for (int n = 0; n<N/2; n++)
					{c=St.at(n+N*j/2)+St.at(N/2+n+N*j/2);
					pS -> push_back(c);
					}
			else
				for (int n = 0; n<N/2; n++)
					{c=(St.at(N*(j-1)/2+n)-St.at(N/2+n+N*(j-1)/2))*exp(1.0, -2*M_PI*n/N);
					pS -> push_back(c);
					}
			
			}
N=N/2;
St.clear();
for (int i=0; i<pS->size(); i++)
{c=pS->at(i);
St.push_back(c);
}
pS->clear();
}

int z;
N=St.size();
for (int i = 0; i<N; i++)
{
int j=0;
int n1 = 0;
int m=1;
j=i;
	for (int k1=N; k1>1;)
	{
	m=m*2;
	z=int(j%(k1/2));
	n1=n1+(j-z)*m/k1;	
	k1=k1/2;
	j=z;}
c=St.at(n1);
pS->push_back(c);
}
return 0;
};

#define _FFT_H
#endif
/**********************************************/


Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Reply by Steven G. Johnson ●January 4, 20082008-01-04

On Jan 4, 10:58 am, Rune Allnor <all...@tele.ntnu.no> wrote:
> Have anybody tested what can be gained by switching
> to C++-style[*] template programming (as opposed to
> regular code)?

What do you suggest using the templates for in the case of an FFT?

For type polymorphism, it's a coding convenience rather than a
performance issue.  The complex type is also more a question of
convenience.

For STL container templates, the vendor-optimized parts aren't too
useful here.  The fastest FFT algorithms don't really consist of a
sequence of BLAS-like operations or similar collective functions where
the optimized code in STL vector templates will help you (performance-
wise) as far as I can tell; for random access and operations on
individual elements, it is going to be equivalent to ordinary arrays/
pointers.

For Blitz++ style metaprogramming, they have tried it for FFT (http://
www.oonumerics.org/blitz/examples/fft.html). Basically, you can use
templates to write a code generator for hard-coded FFTs of small
sizes.  (Generating a hard-coded FFT for sizes > 128 or so is
impractical due to instruction-cache issues.) However, there's much
more to generating fast hard-coded FFTs than just partial evaluation
of a radix-2 FFT, which is what Blitz++ did IIRC; one has to worry
about code-scheduling issues to minimize register spills (the compiler
can't do it for you because it is NP-hard in general, but there are
specialized solutions that work well for FFTs), one would like to
simplify operations involving trivial powers of roots of unity (which
give 1, i, [1+i]/sqrt(2), etcetera, that lead to simplified complex
multiplications), etcetera....see our papers on FFTW's code
generator.  Of course, in principle this could all be done with
template metaprogramming, since it is Turing-complete (if I recall
correctly), but it doesn't seem like the right language for this sort
of thing. In FFTW, we wrote our code generator in Caml, which is more
suited for writing compiler-like programs (built-in support for
pattern matching, symbolic-expression datatypes, etc). Anyway, all of
this is insufficient for FFTs > 128 or so, as I mentioned; at best, it
gives you building blocks for larger FFTs, as in FFTW.

Regards,
Steven G. Johnson

Previous12 Next

Highest possible speed algo in Vb or C or Pseudocode

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group