DSPRelated.com
Forums

Optimal FFT algorithm for hardware implementation ?

Started by ARH May 2, 2008
Hi

I am looking for an optimal FFT algorithm with low computation
complexity and significant simplicity for hardware implementation.
there are several FFT algorithm like Cooley-Tukey or Prime-factor or
Split-radix ... each have several version with different complexity.

would you mind to help me for finding better algorithm ?

>Hi > >I am looking for an optimal FFT algorithm with low computation >complexity and significant simplicity for hardware implementation. >there are several FFT algorithm like Cooley-Tukey or Prime-factor or >Split-radix ... each have several version with different complexity. > >would you mind to help me for finding better algorithm ? >
The short answer is that "optimal" depends upon your platform for implementation. Go to www.fftw.org. The FFTW code is adaptable and often provides the fastest implementation (on average) for a variety of sizes. Before execution it tests a variety of potential plans and determines the fastest and uses that. In my experience, what often seems to be the fastest is not due to memory/cache issues as well as instruction scheduling, etc. Mark
On May 2, 9:04 am, "markt" <tak...@pericle.com> wrote:
> >Hi > > >I am looking for an optimal FFT algorithm with low computation > >complexity and significant simplicity for hardware implementation. > >there are several FFT algorithm like Cooley-Tukey or Prime-factor or > >Split-radix ... each have several version with different complexity. > > >would you mind to help me for finding better algorithm ? > > The short answer is that "optimal" depends upon your platform for > implementation. > > Go towww.fftw.org. The FFTW code is adaptable and often provides the > fastest implementation (on average) for a variety of sizes. Before > execution it tests a variety of potential plans and determines the fastest > and uses that. In my experience, what often seems to be the fastest is not > due to memory/cache issues as well as instruction scheduling, etc. > > Mark
Hi Mark Thank you for reply, I didn't speak about target hardware platform because I want to design my own FFT processor, I don't want to use FFT in my application because FFT is my application :) I found intresting C implementions in FFTW, did you think they have required computation and implementation simlicity for my work ?
The FFTW source code and design will be a good source but in general
for new hardware design it may not provide too much insight because it
was designed to be an optimal FFT for current processors (ie Intel,
etc).  There is an interesting forward in IEEE SP Magazine from the
last couple months that touches on the subject of the FFTW.

If you are experimenting with new hardware design you have much more
flexibility in your data flow, computation, etc.  You will have to
balance resources, size, power consumption, etc.  If you do some
searches you will find interesting implementation based on CORDIC,
Quaterion blocks, and others.

Good Luck.
The FFTW source code and design will be a good source but in general
for new hardware design it may not provide too much insight because
it
was designed to be an optimal FFT for current processors (ie Intel,
etc).  There is an interesting forward in IEEE SP Magazine from the
last couple months that touches on the subject of the FFTW.

If you are experimenting with new hardware design you have much more
flexibility in your data flow, computation, etc.  You will have to
balance resources, size, power consumption, etc.  If you do some
searches you will find interesting implementation based on CORDIC,
Quaterion blocks, and others.

Good Luck.
>I found intresting C implementions in FFTW, did you think they have >required computation and implementation simlicity for my work ?
Hard to say. As Chris noted, if you're developing your own hardware processor it's a different game. FFTW optimizes itself based on what it finds to be the fastest for your specific platform. It has a variety of builtins that are "best" for most of the mainstream processors. I particularly worked on porting the FFTW code to work with the SIMD processor in a MIPS-based system on a chip (Broadcom's quad-core 1480). If you don't have a specific architecture, then FFTW isn't really applicable in that sense. FFTW has just about every major algorithm in its code-base, but it is written through OCAML (as I recall), and difficult to decipher for a port. When I ported the SIMD code to the 1480, I had to actually take the PowerPC code and substitute all of my assembly primitives, with some modification in the routines, to get that to work. Of course, after all that work Code Sourcery comes out with a public port NOW, rather than 18 months ago. ;) Mark
On May 5, 8:54&#4294967295;am, "markt" <tak...@pericle.com> wrote:
> >I found intresting C implementions in FFTW, did you think they have > >required computation and implementation simlicity for my work ? > > Hard to say. &#4294967295;As Chris noted, if you're developing your own hardware > processor it's a different game. &#4294967295;FFTW optimizes itself based on what it > finds to be the fastest for your specific platform. &#4294967295;It has a variety of > builtins that are "best" for most of the mainstream processors. &#4294967295;I > particularly worked on porting the FFTW code to work with the SIMD > processor in a MIPS-based system on a chip (Broadcom's quad-core 1480). &#4294967295;If > you don't have a specific architecture, then FFTW isn't really applicable > in that sense. &#4294967295;FFTW has just about every major algorithm in its code-base, > but it is written through OCAML (as I recall), and difficult to decipher > for a port. &#4294967295;When I ported the SIMD code to the 1480, I had to actually > take the PowerPC code and substitute all of my assembly primitives, with > some modification in the routines, to get that to work. &#4294967295;Of course, after > all that work Code Sourcery comes out with a public port NOW, rather than > 18 months ago. ;) > > Mark
Hi Mark Thanks for your reply, seems you done hard but interesting work, congratulation! ARH
Hi Chris

Sorry for my late reply, I am new to FFT and DSP algorithms, so I try
to found more about CORDIC algorithm in wiki and I found that this
algorithm developed to calculate hyperbolic and trigonometric
functions no hardware multiplier is available but I have no limitation
in my design so I want to use multiplier for upgrading matrix
multiplication performance. Regarding to Quaternion I found no related
text in the net, I think it would be an extension of complex
numbers !

I think it would be better for me to forget optimize FFT algorithm,
because these algorithms have complexity which aren&#4294967295;t suitable for
starting a design. So I prefer to start with a very simple FFT
algorithm, would you mind help me to figure out this from jungle of
information in the net ?


Regards
ARH

On May 4, 4:21 pm, chris <chris.fel...@gmail.com> wrote:
> The FFTW source code and design will be a good source but in general > for new hardware design it may not provide too much insight because > it > was designed to be an optimal FFT for current processors (ie Intel, > etc). There is an interesting forward in IEEE SP Magazine from the > last couple months that touches on the subject of the FFTW. > > If you are experimenting with new hardware design you have much more > flexibility in your data flow, computation, etc. You will have to > balance resources, size, power consumption, etc. If you do some > searches you will find interesting implementation based on CORDIC, > Quaterion blocks, and others. > > Good Luck.
>I think it would be better for me to forget optimize FFT algorithm, >because these algorithms have complexity which aren=92t suitable for >starting a design. So I prefer to start with a very simple FFT >algorithm, would you mind help me to figure out this from jungle of >information in the net ?
There's another thread in the first page or two here that gets in to some of the concepts involved with optimizing an FFT. Steve Johnson, one of the authors of FFTW, has made several posts in this regard. The term optimal, ultimately, needs to be referred to some parameter such as FLOPS or speed. Fewer FLOPS does not always equal faster speed for a variety of reasons. The simplest FFT design (sort of) is a standard radix-2 decomposition assuming you have a power of 2 data set. Every power of 2 in data size requires another column of radix-2 butterflies, but the butterflies are easy to construct (the "hard" part is determining the mapping from column to column). Mark
Thanks again, Chris also sent a useful mail for me which mentioned
similar solution, for gathering all of your respect informations
together I was paste his email here:

Chris:
" If you are not too worried about hardware resources and want a
fairly straight forward implementation, use the cooley-tukey method.
This is the original FFT algorithm.  It is usually implemented in code
with 2 for loops.  For hardware you can unroll the loops and have it
fully parallel.

For an FFT, depending on the radix, there is usually two parts.  FIrst
a 2-point DFT (again depending on radix) then the basic butterfly
algebra.  If you start with a simple 8 point FFT, you can implement 4
parallel 2-point FFT and then the butterfly.

The book "DSP Integrated Circuits" by Lars Wanhammar, has a case study
at the end of each chapter, piece by piece, developing different FFT
cores.  I don't know of any particular online resources, the online
Numerical Recipes (http://www.nrbook.com/nr3/) and DSPGuide (http://
www.dspguide.com/) will be good general information (not hardware
specific) online resources.

Hope that helps."