DSPRelated.com
Forums

FFT tricks for optimized multichannel processing

Started by rsdio January 22, 2011
On Jan 31, 2:58 pm, dbd wrote:
>Looking at the DSP55x architecture, the right answer (well, efficient >implementation) for the OP probably relates more to DSP architectural >features than workstation benchmarks. The DSP55x chips have a hardware >fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed >of the directly programmed CPU on the chip, so any scheme that leads >to the use of larger transforms than supported by the HWAFFT is likely >to be less efficient. > >Dale B. Dalrymple
Only v3 of the DSP C55x has hardware fft acceleration. I'm working with v2, which does not have hardware acceleration. But it can perform two parallel multiplications in a single cycle, meaning that the usual concerns of workstation benchmarks are not relevant. In other words, multiplication is no more expensive than addition. Another factor is that bit-reversed addressing is supported by the processor, so it actually has 0 cost. Brian Willoughby Sound Consulting
On Jan 31, 2:58 pm, dbd wrote:

>So are you relating workstation benchmarks or benchmarks from a DSP >architecture?
Neither. I'm simply saying that if there are two algorithms that can be implemented in different ways, then it makes sense to use the more 'efficient' one ('efficiency' meaning different things, depending on the design constraints). Consider the '2N real with and N complex' technique. It's been around since 1967, but there are different implementations. On of them is: http://focus.ti.com.cn/cn/lit/an/spra291/spra291.pdf I wouldn't doubt that quite a few people have made use of it, because it's a canned routine from a major vendor. Many might consider it 'efficient'. Now I don't believe that I've ever met the author of the above, but I'm sure he's a very smart man, and I appreciate his effort and expertise. But I have a different version of the above that uses 1/2 the multiplies, slightly less adds, and 1/8 the size of the look-up tables (and I presume that the N complex FFT would be the same for both algorithms). (kevin wrote):
>> As for me, I usually stick with radix 8 to minimize multiplies, and >> use look-up tables for twiddle factors and bit reverse. I can do >> other things to be more efficient, but it usually isn't worth it to >> improve something by 5 or 10% if it takes twice (or more) as long to >> write and test it.
(Dale wrote):
>That's a common choice for people implementing benchmarks at their >desk on a workstation. People implementing high volume embedded >systems look at that tradeoff differently.
I programmed my new version of the '2N real w/ N complex' on a desktop while working on a hardware design. And I sketched it out with paper and pencil first. I'm not sure that 'embedded' or 'high volume' or any other adjective guarantees an efficient result.
>Looking at the DSP55x architecture, the right answer (well, efficient >implementation) for the OP probably relates more to DSP architectural >features than workstation benchmarks. The DSP55x chips have a hardware >fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed >of the directly programmed CPU on the chip, so any scheme that leads >to the use of larger transforms than supported by the HWAFFT is likely >to be less efficient.
But you can slice and dice an FFT algorithm to fit the size of the available resources. As usual, the resources constrain the design space and influence design decisions. No matter if it's custom hardware, gate array, FPGA, DSP, ... microcontroller, abacus, etc., I'd rather start with an efficient algorithm that fits within the design constraints. And it can also happen that resources are so limited that one cannot use a more 'efficient' algorithm (eg: a look- up table approach for a microcontroller with very limited memory
>On the other hand, if the OP actually has a homework problem to pack >multiple channels of FFT into a single FFT call, we haven't helped the >OP (or Rick) much.
Well, I thought the OP could use the 'replicate the inner butterfly code' and 'Two N real with an N complex" ideas. It's worked for me with image processing. But it can be tricky to do. Kevin McGee
On Jan 31, 11:48&#4294967295;pm, kevin <kevinjmc...@netscape.net> wrote:
> On Jan 31, 2:58 pm, dbd wrote: > > >So are you relating workstation benchmarks or benchmarks from a DSP > >architecture? > > Neither. &#4294967295;I'm simply saying that if there are two algorithms that can > be implemented in different ways, then it makes sense to use the more > 'efficient' one ('efficiency' meaning different things, depending on > the design constraints). > ...
You posted:On Jan 30, 12:16 am, kevin <kevinjmc...@netscape.net> wrote: #> I've timed both approaches and have found the single complex FFT plus #> separation overhead to be faster than the two real approach, #> especially for large N. #> ... #> I've also found similar results for the '2N real with an N #> complex' (eg: 1024 real points with a 512 point complex FFT plus pre #> and post processing). But the cross-over point where doing things #> with the one complex FFT becomes faster is larger due to the #> additional processing that this approach requires (I've substantially #> reduced that processing by using a modified method, but it's not #> included in my tutorial). #> ... I was asking to learn about the context of the timings you mentioned.
> >> As for me, I usually stick with radix 8 to minimize multiplies, and > >> use look-up tables for twiddle factors and bit reverse. &#4294967295;I can do > >> other things to be more efficient, but it usually isn't worth it to > >> improve something by 5 or 10% if it takes twice (or more) as long to > >> write and test it. > > (Dale wrote): > > >That's a common choice for people implementing benchmarks at their > >desk on a workstation. People implementing high volume embedded > >systems look at that tradeoff differently. > > ... > I'm not sure that 'embedded' or 'high volume' or any other adjective > guarantees an efficient result. ...
By "common choice ... on a workstation" I was referring to "it isn't worth it to improve something by 5 or 10%". Embedded and high volume developers might go for the 5 or 10% targets. They might have to. Dale B. Dalrymple