comp.dsp | FFT tricks for optimized multichannel processing| page 2

Reply by dbd ●January 29, 20112011-01-29

On Jan 28, 11:54&#4294967295;pm, glen herrmannsfeldt <g...@ugcs.caltech.edu>
wrote:
> ...
>
> There is a common trick to do a real transform of length 2N as
> a complex transform of length N. &#4294967295;You should also be able to do
> two real transforms of length N as one complex transform of length N.

As Kevin's tutorial points out, the common tricks are less efficient
that performing a 2N point real input transform or two N point real
input transforms. The tricks are efficient only when the only fft
implementations you have are complex and your data is real.

>
> It seems to me, then, that it shouldn't be hard to do M transforms
> of length N as a length M*N transform, at least for power of two M's.

The M channels -can- be assembled and disassembled to allow the use of
an M*N point transform to achieve M transforms of N points. The
assembly and disassembly are not free and the M*N point transform is
less efficient so it isn't commonly done.

>
> Then add some work for putting together the longer transform
> and taking apart the result. &#4294967295;It still might be faster than
> separate transforms.
> ...

The computational cost of the M channels of N point transform is
proportional to M*N*log(N). The computational cost of the M*N point
transform is proportional to M*N*log(M*N). Even the big transform
alone is less efficient.

Many DSP architectures are constructed with hierarchies of memory with
smaller more tightly coupled memories faster than larger memories. Any
time the increase in transform size forces the algorithm to perform
more memory accesses from a slower memory level, the algorithm will
slow down even more.

Dale B. Dalrymple

Reply by glen herrmannsfeldt ●January 29, 20112011-01-29

rsdio <Sound_Consulting@n_o_s_p_a_m.sounds.wa.com> wrote:
(snip, I wrote)

>>It seems to me, then, that it shouldn't be hard to do M transforms
>>of length N as a length M*N transform, at least for power of two M's.

(snip)

> Glen, you've just repeated my original posting that started this
> discussion.  It's the details of your last paragraph that I am searching
> for.  I have a hunch that bit-reversed addressing can possibly handle the
> splicing, but I have not worked it out in detail yet.  I am working with
> powers of two, because in my specific application I have 8 channels of
> independent data (at a common sample rate).

OK, the process for converting 2N real points to N complex points,
transforming that, and then converting to the appropriate 2N
point result is known.  It should then be possible to convert
N complex points to 2N real points.

I believe it also isn't hard to do two N point real transforms
as one complex transform.  If so, then you can convert two N
point transforms to a 2N point transform, repeat as needed.

> Now I understand why everyone maintains the entire thread when quoting on
> comp.dsp - I'm using DSPRelated.com, which archives and collects together
> all postings in a thread.  So, I did a bad job of attributing the replies
> when quoting - partly because DSPRelated.com does not carry through the
> attributions, and partly because I assumed most folks would have access to
> the whole thread anyway.
 
> So, Glen, I'm not complaining, but rather apologizing for my posting of a
> confusing quotation.  I should probably dust off my copy of trn

-- glen

Reply by dbd ●January 30, 20112011-01-30

On Jan 29, 6:57&#4294967295;pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> ...
> OK, the process for converting 2N real points to N complex points,
> transforming that, and then converting to the appropriate 2N
> point result is known. &#4294967295;It should then be possible to convert
> N complex points to 2N real points.
>
> I believe it also isn't hard to do two N point real transforms
> as one complex transform. &#4294967295;If so, then you can convert two N
> point transforms to a 2N point transform, repeat as needed.
> ...
> -- glen

Since each of these processes takes time domain samples as inputs and
produces frequency domain coefficients as outputs, how is it possible
to "repeat as necessary"?

Dale B. Dalrymple

Reply by kevin ●January 30, 20112011-01-30

On Jan 29, 1:53 pm, dbd <d...@ieee.org> wrote:

>As Kevin's tutorial points out, the common tricks are less efficient
>that performing a 2N point real input transform or two N point real
>input transforms. The tricks are efficient only when the only fft
>implementations you have are complex and your data is real.

Actually, Dale, transforming two separate N point real sequences with
a single complex in/out N point FFT could be faster than doing two
separate N point real FFTs, especially for larger N.  As mentioned in
the tutorial, the two real FFTs for N points have to run faster than
the one complex N point FFT plus post-processing to separate the two
transforms.

Although an N point FFT for real inputs can be faster than an N point
for complex inputs, there's always set-up and addressing overhead
involved, too. Plus, for efficient processing, one would use look-up
tables or twiddle recursions instead of sin/cos library calls. So
there are twiddle tables or recursions or bit-reverse tables, etc.
that one has to deal with.

I've timed both approaches and have found the single complex FFT plus
separation overhead to be faster than the two real approach,
especially for large N.

I've also found similar results for the '2N real with an N
complex' (eg: 1024 real points with a 512 point complex FFT plus pre
and post processing).  But the cross-over point where doing things
with the one complex FFT becomes faster is larger due to the
additional processing that this approach requires (I've substantially
reduced that processing by using a modified method, but it's not
included in my tutorial).

Of course, using global twiddle tables or multi-threading changes
things, and it could tilt the decision as to what approach is more
efficient substantially.

So yes, one or the other approach could be more or less efficient than
the other, depending on what processing resources are available.

You also mentioned :

>The M channels -can- be assembled and disassembled to allow the use of
>an M*N point transform to achieve M transforms of N points. The
>assembly and disassembly are not free and the M*N point transform is
>less efficient so it isn't commonly done.

Strongly agree.  The assembly/disassembly is definitely not free.  And
I doubt if the extra M FFT stages are really necessary (versus, say, M
parallel N point implementations).

Kevin McGee

Reply by kevin ●January 30, 20112011-01-30

On Jan 28, 7:56 pm, "rsdio"
<Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote:

>In my specific application, I have a hunch that it would not be possible to
>calculate multiple transforms at the same time. I'm guessing that since
>there isn't a readily available radix-8 butterfly, then the processor
>probably has too few registers to calculate multiple transforms in SIMD
>fashion. I could be way off, though, which is why I am bringing this up on
>comp.dsp

Radix has little to do with it - higher radix has to do with the lower
number of complex multiplies in the FFT.
As for your other concerns, let me clarify that there are really two
distinct techniques:

1) Two different N point real sequences: (xr[N], xi[N] = 0) and
(yr[N], yi[N] = 0) transformed simultaneously with a complex in/out N
point FFT.  A small amount of post-processing is required to separate
them.

2) One 2N point real sequence (eg: xr[128], xi[128] = 0) transformed
using one 64 point complex in/out FFT plus pre-processing (ie: odd/
even split) and post-processing (ie: separate even/odd transforms;
recombine to get 2N).

For your purposes (multiple channels of real data), the first
technique is the one to use.

As for processing more than 2 channels (say, 4 or 6 or 10), I suppose
that it might be possible in a single, larger FFT, but I have serious
doubts as to whether that would be an efficient way of doing things.
In fact, it could be grossly inefficient.

When I referred to doing more channels in parallel, this is what I
meant.  To keep things simple, let's say that there's no large memory
space (so we'll have to use a recursion for our twiddle calculations),
and also no multi-threading, etc.
A typical recursive FFT might look like the 3 loop structure below:

do ( some addressing )
    do( more addressing )
        do ( butterfly calculations )
        ......
        }
        twiddle recursion stuff
    }
    more twiddle recursion stuff
}

To make the software parallel so that it can operate on multiple
channels, why not replicate the butterfly calculation part as many
times as you need?  You could still use the 'two N point real with and
N point complex' method.  In other words, for processing 8 channels of
real data, make channels 1 and 2 the real and imaginary parts of a
complex input.  So to with real channels 3 and 4, etc.  Then you could
program an N point complex in/out FFT, and repeat the butterfly code 4
times.  After the FFT, you'd have 4 complex results - the first one
contains channels 1 and 2 (which will be separated by post-
processing).  And so on.

The important thing is that you're not doing the addressing and
twiddle recursion overhead for each channel. You do it only once.

I've done this kind of thing before for image processing, and it
improves things.  But it might be difficult to do if you have to work
with existing code that someone else wrote.

Kevin McGee

Reply by rsdio ●January 30, 20112011-01-30

Thanks, Kevin.  You've certainly given me some ideas and specific
algorithms to try out.  It will take me a while to digest, and then I have
to fit the optimization phase of my development into my schedule.  Until
then, you have my thanks in advance.

Reply by dbd ●January 30, 20112011-01-30

On Jan 30, 12:16&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
> ...
>
> Of course, using global twiddle tables or multi-threading changes
> things, and it could tilt the decision as to what approach is more
> efficient substantially.
>
> So yes, one or the other approach could be more or less efficient than
> the other, depending on what processing resources are available.
> ...
> Kevin McGee

The point of using a dedicated DSP chip with a RTOS instead of a
Windows OS on an Intel CPU is to be able to put all the table setup
into the initialization and avoid any recurring setup costs during
real-time operation. The OP seems to have a DSP chip. What are the
timings like for an efficient real-time implementation without the
setup overhead?

Dale B. Dalrymple

Reply by kevin ●January 31, 20112011-01-31

On Jan 30, 12:32&#4294967295;pm, dbd <d...@ieee.org> wrote:
> On Jan 30, 12:16&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
>
> > ...
>
> > Of course, using global twiddle tables or multi-threading changes
> > things, and it could tilt the decision as to what approach is more
> > efficient substantially.
>
> > So yes, one or the other approach could be more or less efficient than
> > the other, depending on what processing resources are available.
> > ...
> > Kevin McGee
>
> The point of using a dedicated DSP chip with a RTOS instead of a
> Windows OS on an Intel CPU is to be able to put all the table setup
> into the initialization and avoid any recurring setup costs during
> real-time operation. The OP seems to have a DSP chip. What are the
> timings like for an efficient real-time implementation without the
> setup overhead?
>
> Dale B. Dalrymple

Well, I used to have a TMS320 evaluation board, but that was ten years
ago.

As far as efficiency, even if you strip out all the overhead, you
still have to do a certain number of calculations for a particular
algorithm.  And you know how many there are beforehand.  So you have a
fairly good idea of what the minimum is for doing the calculations.
Anything beyond that is overhead, and some of it is necessary.

As for me, I usually stick with radix 8 to minimize multiplies, and
use look-up tables for twiddle factors and bit reverse.  I can do
other things to be more efficient, but it usually isn't worth it to
improve something by 5 or 10% if it takes twice (or more) as long to
write and test it.

As far as the 'Two N real with an N complex' method - I think that,
even with an RTOS, it would be more efficient to use it instead of
doing two separate N point real point transforms.  The problem with
the latter is that any address overhead associated with the chosen FFT
algorithm will have to repeated for each real point FFT.  However,
with an RTOS (less overhead), the cross-over point where the 'Two N
real with an N complex' becomes more efficient might be at a larger
value of N.

Kevin McGee

Reply by robert bristow-johnson ●January 31, 20112011-01-31

On Jan 31, 2:50&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
>
> As far as efficiency, even if you strip out all the overhead, you
> still have to do a certain number of calculations for a particular
> algorithm. &#4294967295;And you know how many there are beforehand. &#4294967295;So you have a
> fairly good idea of what the minimum is for doing the calculations.
> Anything beyond that is overhead, and some of it is necessary.

i agree with that.

> As for me, I usually stick with radix 8 to minimize multiplies, and
> use look-up tables for twiddle factors and bit reverse.

Kevin, i am curious as to how multiplies are reduced, but i *do*
understand how bumping it up from radix-2 to radix-4 will save you
from reloading the real and imag parts of the twiddle unnecessarily.
this is because

    e^(j*(theta + pi/4)) = j*e^(j*theta)

so the real and imaginary parts are simply swapped (in their usage,
you need not move the actual numbers in registers) and some additions
are turned into subtractions.  also, if this is in fixed point and you
want to scale it down by sqrt(2) for each FFT pass, since this is 2
passes, it's scaled down by 2 and it's a simple ASR operation on the
results.

i do not see how bumping it from radix-4 to radix-8 will help further,
other than if there is no limit to the number of machine registers,
the cost of moving the FFT buffer data into and outa registers might
be slightly reduced.

this is not exactly radix-8, but i *can* imagine in a single pass,
having 4 butterflies with twiddles of

     e^(j*(theta+0   ))
     e^(j*(theta+pi/4))
     e^(j*(pi/2-theta))
     e^(j*(pi/4-theta))      for 0 < theta < pi/8

which would process 8 points.  but it's not simply a radix-8 thing.
it's two radix-4 passes moving in opposite directions.  there would be
two special cases where theta = 0 and theta = pi/8 where code yanked
outside of the loop would run separately (and efficiently).

otherwise, i dunno what good radix-8 does over radix-4.

> &#4294967295;I can do
> other things to be more efficient, but it usually isn't worth it to
> improve something by 5 or 10% if it takes twice (or more) as long to
> write and test it.

avoiding unnecessary multiplication by 1 or j seems to be worth it.
it means (in the radix-4 case, anyway) that the very first butterfly
of the group is yanked out of the DO loop and handled differently.
that shouldn't be too hard to write and test.

> As far as the 'Two N real with an N complex' method - I think that,
> even with an RTOS, it would be more efficient to use it instead of
> doing two separate N point real point transforms.

i think that's pretty clear.  if you are doing fast convolution, you
can filter two separate *real* streams (or two overlapping blocks of
the same stream) by stuffing one into the real part and the other into
the imaginary part.  do your overlap-add or overlap-save thing and
what comes out in the real and imag parts are your two output streams
(or two output blocks).

r b-j

Reply by dbd ●January 31, 20112011-01-31

On Jan 30, 11:50 pm, kevin <kevinjmc...@netscape.net> wrote:
> ..
>
> Well, I used to have a TMS320 evaluation board, but that was ten years
> ago.

So are you relating workstation benchmarks or benchmarks from a DSP
architecture?

> ...
> As for me, I usually stick with radix 8 to minimize multiplies, and
> use look-up tables for twiddle factors and bit reverse.  I can do
> other things to be more efficient, but it usually isn't worth it to
> improve something by 5 or 10% if it takes twice (or more) as long to
> write and test it.

That's a common choice for people implementing benchmarks at their
desk on a workstation. People implementing high volume embedded
systems look at that tradeoff differently.

Looking at the DSP55x architecture, the right answer (well, efficient
implementation) for the OP probably relates more to DSP architectural
features than workstation benchmarks. The DSP55x chips have a hardware
fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed
of the directly programmed CPU on the chip, so any scheme that leads
to the use of larger transforms than supported by the HWAFFT is likely
to be less efficient.

On the other hand, if the OP actually has a homework problem to pack
multiple channels of FFT into a single FFT call, we haven't helped the
OP (or Rick) much.

Dale B. Dalrymple

Previous 123 Next

FFT tricks for optimized multichannel processing

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group