comp.dsp | FFT tricks for optimized multichannel processing| page 3

Reply by rsdio ●January 31, 20112011-01-31

On Jan 31, 2:58 pm, dbd wrote:
>Looking at the DSP55x architecture, the right answer (well, efficient
>implementation) for the OP probably relates more to DSP architectural
>features than workstation benchmarks. The DSP55x chips have a hardware
>fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed
>of the directly programmed CPU on the chip, so any scheme that leads
>to the use of larger transforms than supported by the HWAFFT is likely
>to be less efficient.
>
>Dale B. Dalrymple

Only v3 of the DSP C55x has hardware fft acceleration.  I'm working with
v2, which does not have hardware acceleration.  But it can perform two
parallel multiplications in a single cycle, meaning that the usual concerns
of workstation benchmarks are not relevant.  In other words, multiplication
is no more expensive than addition.  Another factor is that bit-reversed
addressing is supported by the processor, so it actually has 0 cost.

Brian Willoughby
Sound Consulting

Reply by kevin ●February 1, 20112011-02-01

On Jan 31, 2:58 pm, dbd wrote:

>So are you relating workstation benchmarks or benchmarks from a DSP
>architecture?

Neither.  I'm simply saying that if there are two algorithms that can
be implemented in different ways, then it makes sense to use the more
'efficient' one ('efficiency' meaning different things, depending on
the design constraints).

Consider the '2N real with and N complex' technique.  It's been around
since 1967, but there are different implementations.  On of them is:

http://focus.ti.com.cn/cn/lit/an/spra291/spra291.pdf

I wouldn't doubt that quite a few people have made use of it, because
it's a canned routine from a major vendor.

Many might consider it 'efficient'.  Now I don't believe that I've
ever met the author of the above, but I'm sure he's a very smart man,
and I appreciate his effort and expertise.  But I have a different
version of the above that uses 1/2 the multiplies, slightly less adds,
and 1/8 the size of the look-up tables (and I presume that the N
complex FFT would be the same for both algorithms).

(kevin wrote):
>> As for me, I usually stick with radix 8 to minimize multiplies, and
>> use look-up tables for twiddle factors and bit reverse.  I can do
>> other things to be more efficient, but it usually isn't worth it to
>> improve something by 5 or 10% if it takes twice (or more) as long to
>> write and test it.

(Dale wrote):
>That's a common choice for people implementing benchmarks at their
>desk on a workstation. People implementing high volume embedded
>systems look at that tradeoff differently.

I programmed my new version of the '2N real w/ N complex' on a desktop
while working on a hardware design. And I sketched it out with paper
and pencil first.
I'm not sure that 'embedded' or 'high volume' or any other adjective
guarantees an efficient result.

>Looking at the DSP55x architecture, the right answer (well, efficient
>implementation) for the OP probably relates more to DSP architectural
>features than workstation benchmarks. The DSP55x chips have a hardware
>fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed
>of the directly programmed CPU on the chip, so any scheme that leads
>to the use of larger transforms than supported by the HWAFFT is likely
>to be less efficient.

But you can slice and dice an FFT algorithm to fit the size of the
available resources.  As usual, the resources constrain the design
space and influence design decisions.  No matter if it's custom
hardware, gate array, FPGA, DSP, ... microcontroller, abacus, etc.,
I'd rather start with an efficient algorithm that fits within the
design constraints.  And it can also happen that resources are so
limited that one cannot use a more 'efficient' algorithm (eg: a look-
up table approach for a microcontroller with very limited memory

>On the other hand, if the OP actually has a homework problem to pack
>multiple channels of FFT into a single FFT call, we haven't helped the
>OP (or Rick) much.

Well, I thought the OP could use the 'replicate the inner butterfly
code' and 'Two N real with an N complex" ideas.  It's worked for me
with image processing.  But it can be tricky to do.

Kevin McGee

Reply by dbd ●February 1, 20112011-02-01

On Jan 31, 11:48&#4294967295;pm, kevin <kevinjmc...@netscape.net> wrote:
> On Jan 31, 2:58 pm, dbd wrote:
>
> >So are you relating workstation benchmarks or benchmarks from a DSP
> >architecture?
>
> Neither. &#4294967295;I'm simply saying that if there are two algorithms that can
> be implemented in different ways, then it makes sense to use the more
> 'efficient' one ('efficiency' meaning different things, depending on
> the design constraints).
> ...

You posted:On Jan 30, 12:16 am, kevin <kevinjmc...@netscape.net>
wrote:
#> I've timed both approaches and have found the single complex FFT
plus
#> separation overhead to be faster than the two real approach,
#> especially for large N.
#> ...
#> I've also found similar results for the '2N real with an N
#> complex' (eg: 1024 real points with a 512 point complex FFT plus
pre
#> and post processing).  But the cross-over point where doing things
#> with the one complex FFT becomes faster is larger due to the
#> additional processing that this approach requires (I've
substantially
#> reduced that processing by using a modified method, but it's not
#> included in my tutorial).
#> ...

I was asking to learn about the context of the timings you mentioned.

> >> As for me, I usually stick with radix 8 to minimize multiplies, and
> >> use look-up tables for twiddle factors and bit reverse. &#4294967295;I can do
> >> other things to be more efficient, but it usually isn't worth it to
> >> improve something by 5 or 10% if it takes twice (or more) as long to
> >> write and test it.
>
> (Dale wrote):
>
> >That's a common choice for people implementing benchmarks at their
> >desk on a workstation. People implementing high volume embedded
> >systems look at that tradeoff differently.
>
> ...
> I'm not sure that 'embedded' or 'high volume' or any other adjective
> guarantees an efficient result. ...

By "common choice ... on a workstation" I was referring to "it isn't
worth it to improve something by 5 or 10%". Embedded and high volume
developers might go for the 5 or 10% targets. They might have to.

Dale B. Dalrymple

Previous 1 23Next

FFT tricks for optimized multichannel processing

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group