Reply by dbd February 1, 20112011-02-01
On Jan 31, 11:48&#4294967295;pm, kevin <kevinjmc...@netscape.net> wrote:
> On Jan 31, 2:58 pm, dbd wrote: > > >So are you relating workstation benchmarks or benchmarks from a DSP > >architecture? > > Neither. &#4294967295;I'm simply saying that if there are two algorithms that can > be implemented in different ways, then it makes sense to use the more > 'efficient' one ('efficiency' meaning different things, depending on > the design constraints). > ...
You posted:On Jan 30, 12:16 am, kevin <kevinjmc...@netscape.net> wrote: #> I've timed both approaches and have found the single complex FFT plus #> separation overhead to be faster than the two real approach, #> especially for large N. #> ... #> I've also found similar results for the '2N real with an N #> complex' (eg: 1024 real points with a 512 point complex FFT plus pre #> and post processing). But the cross-over point where doing things #> with the one complex FFT becomes faster is larger due to the #> additional processing that this approach requires (I've substantially #> reduced that processing by using a modified method, but it's not #> included in my tutorial). #> ... I was asking to learn about the context of the timings you mentioned.
> >> As for me, I usually stick with radix 8 to minimize multiplies, and > >> use look-up tables for twiddle factors and bit reverse. &#4294967295;I can do > >> other things to be more efficient, but it usually isn't worth it to > >> improve something by 5 or 10% if it takes twice (or more) as long to > >> write and test it. > > (Dale wrote): > > >That's a common choice for people implementing benchmarks at their > >desk on a workstation. People implementing high volume embedded > >systems look at that tradeoff differently. > > ... > I'm not sure that 'embedded' or 'high volume' or any other adjective > guarantees an efficient result. ...
By "common choice ... on a workstation" I was referring to "it isn't worth it to improve something by 5 or 10%". Embedded and high volume developers might go for the 5 or 10% targets. They might have to. Dale B. Dalrymple
Reply by kevin February 1, 20112011-02-01
On Jan 31, 2:58 pm, dbd wrote:

>So are you relating workstation benchmarks or benchmarks from a DSP >architecture?
Neither. I'm simply saying that if there are two algorithms that can be implemented in different ways, then it makes sense to use the more 'efficient' one ('efficiency' meaning different things, depending on the design constraints). Consider the '2N real with and N complex' technique. It's been around since 1967, but there are different implementations. On of them is: http://focus.ti.com.cn/cn/lit/an/spra291/spra291.pdf I wouldn't doubt that quite a few people have made use of it, because it's a canned routine from a major vendor. Many might consider it 'efficient'. Now I don't believe that I've ever met the author of the above, but I'm sure he's a very smart man, and I appreciate his effort and expertise. But I have a different version of the above that uses 1/2 the multiplies, slightly less adds, and 1/8 the size of the look-up tables (and I presume that the N complex FFT would be the same for both algorithms). (kevin wrote):
>> As for me, I usually stick with radix 8 to minimize multiplies, and >> use look-up tables for twiddle factors and bit reverse. I can do >> other things to be more efficient, but it usually isn't worth it to >> improve something by 5 or 10% if it takes twice (or more) as long to >> write and test it.
(Dale wrote):
>That's a common choice for people implementing benchmarks at their >desk on a workstation. People implementing high volume embedded >systems look at that tradeoff differently.
I programmed my new version of the '2N real w/ N complex' on a desktop while working on a hardware design. And I sketched it out with paper and pencil first. I'm not sure that 'embedded' or 'high volume' or any other adjective guarantees an efficient result.
>Looking at the DSP55x architecture, the right answer (well, efficient >implementation) for the OP probably relates more to DSP architectural >features than workstation benchmarks. The DSP55x chips have a hardware >fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed >of the directly programmed CPU on the chip, so any scheme that leads >to the use of larger transforms than supported by the HWAFFT is likely >to be less efficient.
But you can slice and dice an FFT algorithm to fit the size of the available resources. As usual, the resources constrain the design space and influence design decisions. No matter if it's custom hardware, gate array, FPGA, DSP, ... microcontroller, abacus, etc., I'd rather start with an efficient algorithm that fits within the design constraints. And it can also happen that resources are so limited that one cannot use a more 'efficient' algorithm (eg: a look- up table approach for a microcontroller with very limited memory
>On the other hand, if the OP actually has a homework problem to pack >multiple channels of FFT into a single FFT call, we haven't helped the >OP (or Rick) much.
Well, I thought the OP could use the 'replicate the inner butterfly code' and 'Two N real with an N complex" ideas. It's worked for me with image processing. But it can be tricky to do. Kevin McGee
Reply by rsdio January 31, 20112011-01-31
On Jan 31, 2:58 pm, dbd wrote:
>Looking at the DSP55x architecture, the right answer (well, efficient >implementation) for the OP probably relates more to DSP architectural >features than workstation benchmarks. The DSP55x chips have a hardware >fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed >of the directly programmed CPU on the chip, so any scheme that leads >to the use of larger transforms than supported by the HWAFFT is likely >to be less efficient. > >Dale B. Dalrymple
Only v3 of the DSP C55x has hardware fft acceleration. I'm working with v2, which does not have hardware acceleration. But it can perform two parallel multiplications in a single cycle, meaning that the usual concerns of workstation benchmarks are not relevant. In other words, multiplication is no more expensive than addition. Another factor is that bit-reversed addressing is supported by the processor, so it actually has 0 cost. Brian Willoughby Sound Consulting
Reply by dbd January 31, 20112011-01-31
On Jan 30, 11:50 pm, kevin <kevinjmc...@netscape.net> wrote:
> .. > > Well, I used to have a TMS320 evaluation board, but that was ten years > ago.
So are you relating workstation benchmarks or benchmarks from a DSP architecture?
> ... > As for me, I usually stick with radix 8 to minimize multiplies, and > use look-up tables for twiddle factors and bit reverse. I can do > other things to be more efficient, but it usually isn't worth it to > improve something by 5 or 10% if it takes twice (or more) as long to > write and test it.
That's a common choice for people implementing benchmarks at their desk on a workstation. People implementing high volume embedded systems look at that tradeoff differently. Looking at the DSP55x architecture, the right answer (well, efficient implementation) for the OP probably relates more to DSP architectural features than workstation benchmarks. The DSP55x chips have a hardware fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed of the directly programmed CPU on the chip, so any scheme that leads to the use of larger transforms than supported by the HWAFFT is likely to be less efficient. On the other hand, if the OP actually has a homework problem to pack multiple channels of FFT into a single FFT call, we haven't helped the OP (or Rick) much. Dale B. Dalrymple
Reply by robert bristow-johnson January 31, 20112011-01-31
On Jan 31, 2:50&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
> > As far as efficiency, even if you strip out all the overhead, you > still have to do a certain number of calculations for a particular > algorithm. &#4294967295;And you know how many there are beforehand. &#4294967295;So you have a > fairly good idea of what the minimum is for doing the calculations. > Anything beyond that is overhead, and some of it is necessary.
i agree with that.
> As for me, I usually stick with radix 8 to minimize multiplies, and > use look-up tables for twiddle factors and bit reverse.
Kevin, i am curious as to how multiplies are reduced, but i *do* understand how bumping it up from radix-2 to radix-4 will save you from reloading the real and imag parts of the twiddle unnecessarily. this is because e^(j*(theta + pi/4)) = j*e^(j*theta) so the real and imaginary parts are simply swapped (in their usage, you need not move the actual numbers in registers) and some additions are turned into subtractions. also, if this is in fixed point and you want to scale it down by sqrt(2) for each FFT pass, since this is 2 passes, it's scaled down by 2 and it's a simple ASR operation on the results. i do not see how bumping it from radix-4 to radix-8 will help further, other than if there is no limit to the number of machine registers, the cost of moving the FFT buffer data into and outa registers might be slightly reduced. this is not exactly radix-8, but i *can* imagine in a single pass, having 4 butterflies with twiddles of e^(j*(theta+0 )) e^(j*(theta+pi/4)) e^(j*(pi/2-theta)) e^(j*(pi/4-theta)) for 0 < theta < pi/8 which would process 8 points. but it's not simply a radix-8 thing. it's two radix-4 passes moving in opposite directions. there would be two special cases where theta = 0 and theta = pi/8 where code yanked outside of the loop would run separately (and efficiently). otherwise, i dunno what good radix-8 does over radix-4.
> &#4294967295;I can do > other things to be more efficient, but it usually isn't worth it to > improve something by 5 or 10% if it takes twice (or more) as long to > write and test it.
avoiding unnecessary multiplication by 1 or j seems to be worth it. it means (in the radix-4 case, anyway) that the very first butterfly of the group is yanked out of the DO loop and handled differently. that shouldn't be too hard to write and test.
> As far as the 'Two N real with an N complex' method - I think that, > even with an RTOS, it would be more efficient to use it instead of > doing two separate N point real point transforms.
i think that's pretty clear. if you are doing fast convolution, you can filter two separate *real* streams (or two overlapping blocks of the same stream) by stuffing one into the real part and the other into the imaginary part. do your overlap-add or overlap-save thing and what comes out in the real and imag parts are your two output streams (or two output blocks). r b-j
Reply by kevin January 31, 20112011-01-31
On Jan 30, 12:32&#4294967295;pm, dbd <d...@ieee.org> wrote:
> On Jan 30, 12:16&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote: > > > ... > > > Of course, using global twiddle tables or multi-threading changes > > things, and it could tilt the decision as to what approach is more > > efficient substantially. > > > So yes, one or the other approach could be more or less efficient than > > the other, depending on what processing resources are available. > > ... > > Kevin McGee > > The point of using a dedicated DSP chip with a RTOS instead of a > Windows OS on an Intel CPU is to be able to put all the table setup > into the initialization and avoid any recurring setup costs during > real-time operation. The OP seems to have a DSP chip. What are the > timings like for an efficient real-time implementation without the > setup overhead? > > Dale B. Dalrymple
Well, I used to have a TMS320 evaluation board, but that was ten years ago. As far as efficiency, even if you strip out all the overhead, you still have to do a certain number of calculations for a particular algorithm. And you know how many there are beforehand. So you have a fairly good idea of what the minimum is for doing the calculations. Anything beyond that is overhead, and some of it is necessary. As for me, I usually stick with radix 8 to minimize multiplies, and use look-up tables for twiddle factors and bit reverse. I can do other things to be more efficient, but it usually isn't worth it to improve something by 5 or 10% if it takes twice (or more) as long to write and test it. As far as the 'Two N real with an N complex' method - I think that, even with an RTOS, it would be more efficient to use it instead of doing two separate N point real point transforms. The problem with the latter is that any address overhead associated with the chosen FFT algorithm will have to repeated for each real point FFT. However, with an RTOS (less overhead), the cross-over point where the 'Two N real with an N complex' becomes more efficient might be at a larger value of N. Kevin McGee
Reply by dbd January 30, 20112011-01-30
On Jan 30, 12:16&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
> ... > > Of course, using global twiddle tables or multi-threading changes > things, and it could tilt the decision as to what approach is more > efficient substantially. > > So yes, one or the other approach could be more or less efficient than > the other, depending on what processing resources are available. > ... > Kevin McGee
The point of using a dedicated DSP chip with a RTOS instead of a Windows OS on an Intel CPU is to be able to put all the table setup into the initialization and avoid any recurring setup costs during real-time operation. The OP seems to have a DSP chip. What are the timings like for an efficient real-time implementation without the setup overhead? Dale B. Dalrymple
Reply by rsdio January 30, 20112011-01-30
Thanks, Kevin.  You've certainly given me some ideas and specific
algorithms to try out.  It will take me a while to digest, and then I have
to fit the optimization phase of my development into my schedule.  Until
then, you have my thanks in advance.
Reply by kevin January 30, 20112011-01-30
On Jan 28, 7:56 pm, "rsdio"
<Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote:

>In my specific application, I have a hunch that it would not be possible to >calculate multiple transforms at the same time. I'm guessing that since >there isn't a readily available radix-8 butterfly, then the processor >probably has too few registers to calculate multiple transforms in SIMD >fashion. I could be way off, though, which is why I am bringing this up on >comp.dsp
Radix has little to do with it - higher radix has to do with the lower number of complex multiplies in the FFT. As for your other concerns, let me clarify that there are really two distinct techniques: 1) Two different N point real sequences: (xr[N], xi[N] = 0) and (yr[N], yi[N] = 0) transformed simultaneously with a complex in/out N point FFT. A small amount of post-processing is required to separate them. 2) One 2N point real sequence (eg: xr[128], xi[128] = 0) transformed using one 64 point complex in/out FFT plus pre-processing (ie: odd/ even split) and post-processing (ie: separate even/odd transforms; recombine to get 2N). For your purposes (multiple channels of real data), the first technique is the one to use. As for processing more than 2 channels (say, 4 or 6 or 10), I suppose that it might be possible in a single, larger FFT, but I have serious doubts as to whether that would be an efficient way of doing things. In fact, it could be grossly inefficient. When I referred to doing more channels in parallel, this is what I meant. To keep things simple, let's say that there's no large memory space (so we'll have to use a recursion for our twiddle calculations), and also no multi-threading, etc. A typical recursive FFT might look like the 3 loop structure below: do ( some addressing ) do( more addressing ) do ( butterfly calculations ) ...... } twiddle recursion stuff } more twiddle recursion stuff } To make the software parallel so that it can operate on multiple channels, why not replicate the butterfly calculation part as many times as you need? You could still use the 'two N point real with and N point complex' method. In other words, for processing 8 channels of real data, make channels 1 and 2 the real and imaginary parts of a complex input. So to with real channels 3 and 4, etc. Then you could program an N point complex in/out FFT, and repeat the butterfly code 4 times. After the FFT, you'd have 4 complex results - the first one contains channels 1 and 2 (which will be separated by post- processing). And so on. The important thing is that you're not doing the addressing and twiddle recursion overhead for each channel. You do it only once. I've done this kind of thing before for image processing, and it improves things. But it might be difficult to do if you have to work with existing code that someone else wrote. Kevin McGee
Reply by kevin January 30, 20112011-01-30
On Jan 29, 1:53 pm, dbd <d...@ieee.org> wrote:

>As Kevin's tutorial points out, the common tricks are less efficient >that performing a 2N point real input transform or two N point real >input transforms. The tricks are efficient only when the only fft >implementations you have are complex and your data is real.
Actually, Dale, transforming two separate N point real sequences with a single complex in/out N point FFT could be faster than doing two separate N point real FFTs, especially for larger N. As mentioned in the tutorial, the two real FFTs for N points have to run faster than the one complex N point FFT plus post-processing to separate the two transforms. Although an N point FFT for real inputs can be faster than an N point for complex inputs, there's always set-up and addressing overhead involved, too. Plus, for efficient processing, one would use look-up tables or twiddle recursions instead of sin/cos library calls. So there are twiddle tables or recursions or bit-reverse tables, etc. that one has to deal with. I've timed both approaches and have found the single complex FFT plus separation overhead to be faster than the two real approach, especially for large N. I've also found similar results for the '2N real with an N complex' (eg: 1024 real points with a 512 point complex FFT plus pre and post processing). But the cross-over point where doing things with the one complex FFT becomes faster is larger due to the additional processing that this approach requires (I've substantially reduced that processing by using a modified method, but it's not included in my tutorial). Of course, using global twiddle tables or multi-threading changes things, and it could tilt the decision as to what approach is more efficient substantially. So yes, one or the other approach could be more or less efficient than the other, depending on what processing resources are available. You also mentioned :
>The M channels -can- be assembled and disassembled to allow the use of >an M*N point transform to achieve M transforms of N points. The >assembly and disassembly are not free and the M*N point transform is >less efficient so it isn't commonly done.
Strongly agree. The assembly/disassembly is definitely not free. And I doubt if the extra M FFT stages are really necessary (versus, say, M parallel N point implementations). Kevin McGee