On Jan 31, 11:48&#4294967295;pm, kevin <kevinjmc...@netscape.net> wrote:
> On Jan 31, 2:58 pm, dbd wrote:
>
> >So are you relating workstation benchmarks or benchmarks from a DSP
> >architecture?
>
> Neither. &#4294967295;I'm simply saying that if there are two algorithms that can
> be implemented in different ways, then it makes sense to use the more
> 'efficient' one ('efficiency' meaning different things, depending on
> the design constraints).
> ...

You posted:On Jan 30, 12:16 am, kevin <kevinjmc...@netscape.net>
wrote:
#> I've timed both approaches and have found the single complex FFT
plus
#> separation overhead to be faster than the two real approach,
#> especially for large N.
#> ...
#> I've also found similar results for the '2N real with an N
#> complex' (eg: 1024 real points with a 512 point complex FFT plus
pre
#> and post processing).  But the cross-over point where doing things
#> with the one complex FFT becomes faster is larger due to the
#> additional processing that this approach requires (I've
substantially
#> reduced that processing by using a modified method, but it's not
#> included in my tutorial).
#> ...

I was asking to learn about the context of the timings you mentioned.

> >> As for me, I usually stick with radix 8 to minimize multiplies, and
> >> use look-up tables for twiddle factors and bit reverse. &#4294967295;I can do
> >> other things to be more efficient, but it usually isn't worth it to
> >> improve something by 5 or 10% if it takes twice (or more) as long to
> >> write and test it.
>
> (Dale wrote):
>
> >That's a common choice for people implementing benchmarks at their
> >desk on a workstation. People implementing high volume embedded
> >systems look at that tradeoff differently.
>
> ...
> I'm not sure that 'embedded' or 'high volume' or any other adjective
> guarantees an efficient result. ...

By "common choice ... on a workstation" I was referring to "it isn't
worth it to improve something by 5 or 10%". Embedded and high volume
developers might go for the 5 or 10% targets. They might have to.

Dale B. Dalrymple

On Jan 31, 2:58 pm, dbd wrote:

>So are you relating workstation benchmarks or benchmarks from a DSP
>architecture?

Neither.  I'm simply saying that if there are two algorithms that can
be implemented in different ways, then it makes sense to use the more
'efficient' one ('efficiency' meaning different things, depending on
the design constraints).

Consider the '2N real with and N complex' technique.  It's been around
since 1967, but there are different implementations.  On of them is:

http://focus.ti.com.cn/cn/lit/an/spra291/spra291.pdf

I wouldn't doubt that quite a few people have made use of it, because
it's a canned routine from a major vendor.

Many might consider it 'efficient'.  Now I don't believe that I've
ever met the author of the above, but I'm sure he's a very smart man,
and I appreciate his effort and expertise.  But I have a different
version of the above that uses 1/2 the multiplies, slightly less adds,
and 1/8 the size of the look-up tables (and I presume that the N
complex FFT would be the same for both algorithms).

(kevin wrote):
>> As for me, I usually stick with radix 8 to minimize multiplies, and
>> use look-up tables for twiddle factors and bit reverse.  I can do
>> other things to be more efficient, but it usually isn't worth it to
>> improve something by 5 or 10% if it takes twice (or more) as long to
>> write and test it.

(Dale wrote):
>That's a common choice for people implementing benchmarks at their
>desk on a workstation. People implementing high volume embedded
>systems look at that tradeoff differently.

I programmed my new version of the '2N real w/ N complex' on a desktop
while working on a hardware design. And I sketched it out with paper
and pencil first.
I'm not sure that 'embedded' or 'high volume' or any other adjective
guarantees an efficient result.

>Looking at the DSP55x architecture, the right answer (well, efficient
>implementation) for the OP probably relates more to DSP architectural
>features than workstation benchmarks. The DSP55x chips have a hardware
>fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed
>of the directly programmed CPU on the chip, so any scheme that leads
>to the use of larger transforms than supported by the HWAFFT is likely
>to be less efficient.

But you can slice and dice an FFT algorithm to fit the size of the
available resources.  As usual, the resources constrain the design
space and influence design decisions.  No matter if it's custom
hardware, gate array, FPGA, DSP, ... microcontroller, abacus, etc.,
I'd rather start with an efficient algorithm that fits within the
design constraints.  And it can also happen that resources are so
limited that one cannot use a more 'efficient' algorithm (eg: a look-
up table approach for a microcontroller with very limited memory

>On the other hand, if the OP actually has a homework problem to pack
>multiple channels of FFT into a single FFT call, we haven't helped the
>OP (or Rick) much.

Well, I thought the OP could use the 'replicate the inner butterfly
code' and 'Two N real with an N complex" ideas.  It's worked for me
with image processing.  But it can be tricky to do.

Kevin McGee

On Jan 31, 2:58 pm, dbd wrote:
>Looking at the DSP55x architecture, the right answer (well, efficient
>implementation) for the OP probably relates more to DSP architectural
>features than workstation benchmarks. The DSP55x chips have a hardware
>fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed
>of the directly programmed CPU on the chip, so any scheme that leads
>to the use of larger transforms than supported by the HWAFFT is likely
>to be less efficient.
>
>Dale B. Dalrymple

Only v3 of the DSP C55x has hardware fft acceleration.  I'm working with
v2, which does not have hardware acceleration.  But it can perform two
parallel multiplications in a single cycle, meaning that the usual concerns
of workstation benchmarks are not relevant.  In other words, multiplication
is no more expensive than addition.  Another factor is that bit-reversed
addressing is supported by the processor, so it actually has 0 cost.

Brian Willoughby
Sound Consulting

On Jan 30, 11:50 pm, kevin <kevinjmc...@netscape.net> wrote:
> ..
>
> Well, I used to have a TMS320 evaluation board, but that was ten years
> ago.

So are you relating workstation benchmarks or benchmarks from a DSP
architecture?

> ...
> As for me, I usually stick with radix 8 to minimize multiplies, and
> use look-up tables for twiddle factors and bit reverse.  I can do
> other things to be more efficient, but it usually isn't worth it to
> improve something by 5 or 10% if it takes twice (or more) as long to
> write and test it.

That's a common choice for people implementing benchmarks at their
desk on a workstation. People implementing high volume embedded
systems look at that tradeoff differently.

Looking at the DSP55x architecture, the right answer (well, efficient
implementation) for the OP probably relates more to DSP architectural
features than workstation benchmarks. The DSP55x chips have a hardware
fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed
of the directly programmed CPU on the chip, so any scheme that leads
to the use of larger transforms than supported by the HWAFFT is likely
to be less efficient.

On the other hand, if the OP actually has a homework problem to pack
multiple channels of FFT into a single FFT call, we haven't helped the
OP (or Rick) much.

Dale B. Dalrymple

On Jan 31, 2:50&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
>
> As far as efficiency, even if you strip out all the overhead, you
> still have to do a certain number of calculations for a particular
> algorithm. &#4294967295;And you know how many there are beforehand. &#4294967295;So you have a
> fairly good idea of what the minimum is for doing the calculations.
> Anything beyond that is overhead, and some of it is necessary.

i agree with that.

> As for me, I usually stick with radix 8 to minimize multiplies, and
> use look-up tables for twiddle factors and bit reverse.

Kevin, i am curious as to how multiplies are reduced, but i *do*
understand how bumping it up from radix-2 to radix-4 will save you
from reloading the real and imag parts of the twiddle unnecessarily.
this is because

    e^(j*(theta + pi/4)) = j*e^(j*theta)

so the real and imaginary parts are simply swapped (in their usage,
you need not move the actual numbers in registers) and some additions
are turned into subtractions.  also, if this is in fixed point and you
want to scale it down by sqrt(2) for each FFT pass, since this is 2
passes, it's scaled down by 2 and it's a simple ASR operation on the
results.

i do not see how bumping it from radix-4 to radix-8 will help further,
other than if there is no limit to the number of machine registers,
the cost of moving the FFT buffer data into and outa registers might
be slightly reduced.

this is not exactly radix-8, but i *can* imagine in a single pass,
having 4 butterflies with twiddles of

     e^(j*(theta+0   ))
     e^(j*(theta+pi/4))
     e^(j*(pi/2-theta))
     e^(j*(pi/4-theta))      for 0 < theta < pi/8

which would process 8 points.  but it's not simply a radix-8 thing.
it's two radix-4 passes moving in opposite directions.  there would be
two special cases where theta = 0 and theta = pi/8 where code yanked
outside of the loop would run separately (and efficiently).

otherwise, i dunno what good radix-8 does over radix-4.

> &#4294967295;I can do
> other things to be more efficient, but it usually isn't worth it to
> improve something by 5 or 10% if it takes twice (or more) as long to
> write and test it.

avoiding unnecessary multiplication by 1 or j seems to be worth it.
it means (in the radix-4 case, anyway) that the very first butterfly
of the group is yanked out of the DO loop and handled differently.
that shouldn't be too hard to write and test.

> As far as the 'Two N real with an N complex' method - I think that,
> even with an RTOS, it would be more efficient to use it instead of
> doing two separate N point real point transforms.

i think that's pretty clear.  if you are doing fast convolution, you
can filter two separate *real* streams (or two overlapping blocks of
the same stream) by stuffing one into the real part and the other into
the imaginary part.  do your overlap-add or overlap-save thing and
what comes out in the real and imag parts are your two output streams
(or two output blocks).

r b-j

On Jan 30, 12:32&#4294967295;pm, dbd <d...@ieee.org> wrote:
> On Jan 30, 12:16&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
>
> > ...
>
> > Of course, using global twiddle tables or multi-threading changes
> > things, and it could tilt the decision as to what approach is more
> > efficient substantially.
>
> > So yes, one or the other approach could be more or less efficient than
> > the other, depending on what processing resources are available.
> > ...
> > Kevin McGee
>
> The point of using a dedicated DSP chip with a RTOS instead of a
> Windows OS on an Intel CPU is to be able to put all the table setup
> into the initialization and avoid any recurring setup costs during
> real-time operation. The OP seems to have a DSP chip. What are the
> timings like for an efficient real-time implementation without the
> setup overhead?
>
> Dale B. Dalrymple

Well, I used to have a TMS320 evaluation board, but that was ten years
ago.

As far as efficiency, even if you strip out all the overhead, you
still have to do a certain number of calculations for a particular
algorithm.  And you know how many there are beforehand.  So you have a
fairly good idea of what the minimum is for doing the calculations.
Anything beyond that is overhead, and some of it is necessary.

As for me, I usually stick with radix 8 to minimize multiplies, and
use look-up tables for twiddle factors and bit reverse.  I can do
other things to be more efficient, but it usually isn't worth it to
improve something by 5 or 10% if it takes twice (or more) as long to
write and test it.

As far as the 'Two N real with an N complex' method - I think that,
even with an RTOS, it would be more efficient to use it instead of
doing two separate N point real point transforms.  The problem with
the latter is that any address overhead associated with the chosen FFT
algorithm will have to repeated for each real point FFT.  However,
with an RTOS (less overhead), the cross-over point where the 'Two N
real with an N complex' becomes more efficient might be at a larger
value of N.

Kevin McGee

On Jan 30, 12:16&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
> ...
>
> Of course, using global twiddle tables or multi-threading changes
> things, and it could tilt the decision as to what approach is more
> efficient substantially.
>
> So yes, one or the other approach could be more or less efficient than
> the other, depending on what processing resources are available.
> ...
> Kevin McGee

The point of using a dedicated DSP chip with a RTOS instead of a
Windows OS on an Intel CPU is to be able to put all the table setup
into the initialization and avoid any recurring setup costs during
real-time operation. The OP seems to have a DSP chip. What are the
timings like for an efficient real-time implementation without the
setup overhead?

Dale B. Dalrymple

Thanks, Kevin.  You've certainly given me some ideas and specific
algorithms to try out.  It will take me a while to digest, and then I have
to fit the optimization phase of my development into my schedule.  Until
then, you have my thanks in advance.

On Jan 28, 7:56 pm, "rsdio"
<Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote:

>In my specific application, I have a hunch that it would not be possible to
>calculate multiple transforms at the same time. I'm guessing that since
>there isn't a readily available radix-8 butterfly, then the processor
>probably has too few registers to calculate multiple transforms in SIMD
>fashion. I could be way off, though, which is why I am bringing this up on
>comp.dsp

Radix has little to do with it - higher radix has to do with the lower
number of complex multiplies in the FFT.
As for your other concerns, let me clarify that there are really two
distinct techniques:

1) Two different N point real sequences: (xr[N], xi[N] = 0) and
(yr[N], yi[N] = 0) transformed simultaneously with a complex in/out N
point FFT.  A small amount of post-processing is required to separate
them.

2) One 2N point real sequence (eg: xr[128], xi[128] = 0) transformed
using one 64 point complex in/out FFT plus pre-processing (ie: odd/
even split) and post-processing (ie: separate even/odd transforms;
recombine to get 2N).

For your purposes (multiple channels of real data), the first
technique is the one to use.

As for processing more than 2 channels (say, 4 or 6 or 10), I suppose
that it might be possible in a single, larger FFT, but I have serious
doubts as to whether that would be an efficient way of doing things.
In fact, it could be grossly inefficient.

When I referred to doing more channels in parallel, this is what I
meant.  To keep things simple, let's say that there's no large memory
space (so we'll have to use a recursion for our twiddle calculations),
and also no multi-threading, etc.
A typical recursive FFT might look like the 3 loop structure below:

do ( some addressing )
    do( more addressing )
        do ( butterfly calculations )
        ......
        }
        twiddle recursion stuff
    }
    more twiddle recursion stuff
}

To make the software parallel so that it can operate on multiple
channels, why not replicate the butterfly calculation part as many
times as you need?  You could still use the 'two N point real with and
N point complex' method.  In other words, for processing 8 channels of
real data, make channels 1 and 2 the real and imaginary parts of a
complex input.  So to with real channels 3 and 4, etc.  Then you could
program an N point complex in/out FFT, and repeat the butterfly code 4
times.  After the FFT, you'd have 4 complex results - the first one
contains channels 1 and 2 (which will be separated by post-
processing).  And so on.

The important thing is that you're not doing the addressing and
twiddle recursion overhead for each channel. You do it only once.

I've done this kind of thing before for image processing, and it
improves things.  But it might be difficult to do if you have to work
with existing code that someone else wrote.

Kevin McGee

On Jan 29, 1:53 pm, dbd <d...@ieee.org> wrote:

>As Kevin's tutorial points out, the common tricks are less efficient
>that performing a 2N point real input transform or two N point real
>input transforms. The tricks are efficient only when the only fft
>implementations you have are complex and your data is real.

Actually, Dale, transforming two separate N point real sequences with
a single complex in/out N point FFT could be faster than doing two
separate N point real FFTs, especially for larger N.  As mentioned in
the tutorial, the two real FFTs for N points have to run faster than
the one complex N point FFT plus post-processing to separate the two
transforms.

Although an N point FFT for real inputs can be faster than an N point
for complex inputs, there's always set-up and addressing overhead
involved, too. Plus, for efficient processing, one would use look-up
tables or twiddle recursions instead of sin/cos library calls. So
there are twiddle tables or recursions or bit-reverse tables, etc.
that one has to deal with.

I've timed both approaches and have found the single complex FFT plus
separation overhead to be faster than the two real approach,
especially for large N.

I've also found similar results for the '2N real with an N
complex' (eg: 1024 real points with a 512 point complex FFT plus pre
and post processing).  But the cross-over point where doing things
with the one complex FFT becomes faster is larger due to the
additional processing that this approach requires (I've substantially
reduced that processing by using a modified method, but it's not
included in my tutorial).

Of course, using global twiddle tables or multi-threading changes
things, and it could tilt the decision as to what approach is more
efficient substantially.

So yes, one or the other approach could be more or less efficient than
the other, depending on what processing resources are available.

You also mentioned :

>The M channels -can- be assembled and disassembled to allow the use of
>an M*N point transform to achieve M transforms of N points. The
>assembly and disassembly are not free and the M*N point transform is
>less efficient so it isn't commonly done.

Strongly agree.  The assembly/disassembly is definitely not free.  And
I doubt if the extra M FFT stages are really necessary (versus, say, M
parallel N point implementations).

Kevin McGee