DSPRelated.com
Forums

FFT tricks for optimized multichannel processing

Started by rsdio January 22, 2011
On Jan 28, 11:54&#4294967295;pm, glen herrmannsfeldt <g...@ugcs.caltech.edu>
wrote:
> ... > > There is a common trick to do a real transform of length 2N as > a complex transform of length N. &#4294967295;You should also be able to do > two real transforms of length N as one complex transform of length N.
As Kevin's tutorial points out, the common tricks are less efficient that performing a 2N point real input transform or two N point real input transforms. The tricks are efficient only when the only fft implementations you have are complex and your data is real.
> > It seems to me, then, that it shouldn't be hard to do M transforms > of length N as a length M*N transform, at least for power of two M's.
The M channels -can- be assembled and disassembled to allow the use of an M*N point transform to achieve M transforms of N points. The assembly and disassembly are not free and the M*N point transform is less efficient so it isn't commonly done.
> > Then add some work for putting together the longer transform > and taking apart the result. &#4294967295;It still might be faster than > separate transforms. > ...
The computational cost of the M channels of N point transform is proportional to M*N*log(N). The computational cost of the M*N point transform is proportional to M*N*log(M*N). Even the big transform alone is less efficient. Many DSP architectures are constructed with hierarchies of memory with smaller more tightly coupled memories faster than larger memories. Any time the increase in transform size forces the algorithm to perform more memory accesses from a slower memory level, the algorithm will slow down even more. Dale B. Dalrymple
rsdio <Sound_Consulting@n_o_s_p_a_m.sounds.wa.com> wrote:
(snip, I wrote)

>>It seems to me, then, that it shouldn't be hard to do M transforms >>of length N as a length M*N transform, at least for power of two M's.
(snip)
> Glen, you've just repeated my original posting that started this > discussion. It's the details of your last paragraph that I am searching > for. I have a hunch that bit-reversed addressing can possibly handle the > splicing, but I have not worked it out in detail yet. I am working with > powers of two, because in my specific application I have 8 channels of > independent data (at a common sample rate).
OK, the process for converting 2N real points to N complex points, transforming that, and then converting to the appropriate 2N point result is known. It should then be possible to convert N complex points to 2N real points. I believe it also isn't hard to do two N point real transforms as one complex transform. If so, then you can convert two N point transforms to a 2N point transform, repeat as needed.
> Now I understand why everyone maintains the entire thread when quoting on > comp.dsp - I'm using DSPRelated.com, which archives and collects together > all postings in a thread. So, I did a bad job of attributing the replies > when quoting - partly because DSPRelated.com does not carry through the > attributions, and partly because I assumed most folks would have access to > the whole thread anyway.
> So, Glen, I'm not complaining, but rather apologizing for my posting of a > confusing quotation. I should probably dust off my copy of trn
-- glen
On Jan 29, 6:57&#4294967295;pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:
> ... > OK, the process for converting 2N real points to N complex points, > transforming that, and then converting to the appropriate 2N > point result is known. &#4294967295;It should then be possible to convert > N complex points to 2N real points. > > I believe it also isn't hard to do two N point real transforms > as one complex transform. &#4294967295;If so, then you can convert two N > point transforms to a 2N point transform, repeat as needed. > ... > -- glen
Since each of these processes takes time domain samples as inputs and produces frequency domain coefficients as outputs, how is it possible to "repeat as necessary"? Dale B. Dalrymple
On Jan 29, 1:53 pm, dbd <d...@ieee.org> wrote:

>As Kevin's tutorial points out, the common tricks are less efficient >that performing a 2N point real input transform or two N point real >input transforms. The tricks are efficient only when the only fft >implementations you have are complex and your data is real.
Actually, Dale, transforming two separate N point real sequences with a single complex in/out N point FFT could be faster than doing two separate N point real FFTs, especially for larger N. As mentioned in the tutorial, the two real FFTs for N points have to run faster than the one complex N point FFT plus post-processing to separate the two transforms. Although an N point FFT for real inputs can be faster than an N point for complex inputs, there's always set-up and addressing overhead involved, too. Plus, for efficient processing, one would use look-up tables or twiddle recursions instead of sin/cos library calls. So there are twiddle tables or recursions or bit-reverse tables, etc. that one has to deal with. I've timed both approaches and have found the single complex FFT plus separation overhead to be faster than the two real approach, especially for large N. I've also found similar results for the '2N real with an N complex' (eg: 1024 real points with a 512 point complex FFT plus pre and post processing). But the cross-over point where doing things with the one complex FFT becomes faster is larger due to the additional processing that this approach requires (I've substantially reduced that processing by using a modified method, but it's not included in my tutorial). Of course, using global twiddle tables or multi-threading changes things, and it could tilt the decision as to what approach is more efficient substantially. So yes, one or the other approach could be more or less efficient than the other, depending on what processing resources are available. You also mentioned :
>The M channels -can- be assembled and disassembled to allow the use of >an M*N point transform to achieve M transforms of N points. The >assembly and disassembly are not free and the M*N point transform is >less efficient so it isn't commonly done.
Strongly agree. The assembly/disassembly is definitely not free. And I doubt if the extra M FFT stages are really necessary (versus, say, M parallel N point implementations). Kevin McGee
On Jan 28, 7:56 pm, "rsdio"
<Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote:

>In my specific application, I have a hunch that it would not be possible to >calculate multiple transforms at the same time. I'm guessing that since >there isn't a readily available radix-8 butterfly, then the processor >probably has too few registers to calculate multiple transforms in SIMD >fashion. I could be way off, though, which is why I am bringing this up on >comp.dsp
Radix has little to do with it - higher radix has to do with the lower number of complex multiplies in the FFT. As for your other concerns, let me clarify that there are really two distinct techniques: 1) Two different N point real sequences: (xr[N], xi[N] = 0) and (yr[N], yi[N] = 0) transformed simultaneously with a complex in/out N point FFT. A small amount of post-processing is required to separate them. 2) One 2N point real sequence (eg: xr[128], xi[128] = 0) transformed using one 64 point complex in/out FFT plus pre-processing (ie: odd/ even split) and post-processing (ie: separate even/odd transforms; recombine to get 2N). For your purposes (multiple channels of real data), the first technique is the one to use. As for processing more than 2 channels (say, 4 or 6 or 10), I suppose that it might be possible in a single, larger FFT, but I have serious doubts as to whether that would be an efficient way of doing things. In fact, it could be grossly inefficient. When I referred to doing more channels in parallel, this is what I meant. To keep things simple, let's say that there's no large memory space (so we'll have to use a recursion for our twiddle calculations), and also no multi-threading, etc. A typical recursive FFT might look like the 3 loop structure below: do ( some addressing ) do( more addressing ) do ( butterfly calculations ) ...... } twiddle recursion stuff } more twiddle recursion stuff } To make the software parallel so that it can operate on multiple channels, why not replicate the butterfly calculation part as many times as you need? You could still use the 'two N point real with and N point complex' method. In other words, for processing 8 channels of real data, make channels 1 and 2 the real and imaginary parts of a complex input. So to with real channels 3 and 4, etc. Then you could program an N point complex in/out FFT, and repeat the butterfly code 4 times. After the FFT, you'd have 4 complex results - the first one contains channels 1 and 2 (which will be separated by post- processing). And so on. The important thing is that you're not doing the addressing and twiddle recursion overhead for each channel. You do it only once. I've done this kind of thing before for image processing, and it improves things. But it might be difficult to do if you have to work with existing code that someone else wrote. Kevin McGee
Thanks, Kevin.  You've certainly given me some ideas and specific
algorithms to try out.  It will take me a while to digest, and then I have
to fit the optimization phase of my development into my schedule.  Until
then, you have my thanks in advance.
On Jan 30, 12:16&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
> ... > > Of course, using global twiddle tables or multi-threading changes > things, and it could tilt the decision as to what approach is more > efficient substantially. > > So yes, one or the other approach could be more or less efficient than > the other, depending on what processing resources are available. > ... > Kevin McGee
The point of using a dedicated DSP chip with a RTOS instead of a Windows OS on an Intel CPU is to be able to put all the table setup into the initialization and avoid any recurring setup costs during real-time operation. The OP seems to have a DSP chip. What are the timings like for an efficient real-time implementation without the setup overhead? Dale B. Dalrymple
On Jan 30, 12:32&#4294967295;pm, dbd <d...@ieee.org> wrote:
> On Jan 30, 12:16&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote: > > > ... > > > Of course, using global twiddle tables or multi-threading changes > > things, and it could tilt the decision as to what approach is more > > efficient substantially. > > > So yes, one or the other approach could be more or less efficient than > > the other, depending on what processing resources are available. > > ... > > Kevin McGee > > The point of using a dedicated DSP chip with a RTOS instead of a > Windows OS on an Intel CPU is to be able to put all the table setup > into the initialization and avoid any recurring setup costs during > real-time operation. The OP seems to have a DSP chip. What are the > timings like for an efficient real-time implementation without the > setup overhead? > > Dale B. Dalrymple
Well, I used to have a TMS320 evaluation board, but that was ten years ago. As far as efficiency, even if you strip out all the overhead, you still have to do a certain number of calculations for a particular algorithm. And you know how many there are beforehand. So you have a fairly good idea of what the minimum is for doing the calculations. Anything beyond that is overhead, and some of it is necessary. As for me, I usually stick with radix 8 to minimize multiplies, and use look-up tables for twiddle factors and bit reverse. I can do other things to be more efficient, but it usually isn't worth it to improve something by 5 or 10% if it takes twice (or more) as long to write and test it. As far as the 'Two N real with an N complex' method - I think that, even with an RTOS, it would be more efficient to use it instead of doing two separate N point real point transforms. The problem with the latter is that any address overhead associated with the chosen FFT algorithm will have to repeated for each real point FFT. However, with an RTOS (less overhead), the cross-over point where the 'Two N real with an N complex' becomes more efficient might be at a larger value of N. Kevin McGee
On Jan 31, 2:50&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
> > As far as efficiency, even if you strip out all the overhead, you > still have to do a certain number of calculations for a particular > algorithm. &#4294967295;And you know how many there are beforehand. &#4294967295;So you have a > fairly good idea of what the minimum is for doing the calculations. > Anything beyond that is overhead, and some of it is necessary.
i agree with that.
> As for me, I usually stick with radix 8 to minimize multiplies, and > use look-up tables for twiddle factors and bit reverse.
Kevin, i am curious as to how multiplies are reduced, but i *do* understand how bumping it up from radix-2 to radix-4 will save you from reloading the real and imag parts of the twiddle unnecessarily. this is because e^(j*(theta + pi/4)) = j*e^(j*theta) so the real and imaginary parts are simply swapped (in their usage, you need not move the actual numbers in registers) and some additions are turned into subtractions. also, if this is in fixed point and you want to scale it down by sqrt(2) for each FFT pass, since this is 2 passes, it's scaled down by 2 and it's a simple ASR operation on the results. i do not see how bumping it from radix-4 to radix-8 will help further, other than if there is no limit to the number of machine registers, the cost of moving the FFT buffer data into and outa registers might be slightly reduced. this is not exactly radix-8, but i *can* imagine in a single pass, having 4 butterflies with twiddles of e^(j*(theta+0 )) e^(j*(theta+pi/4)) e^(j*(pi/2-theta)) e^(j*(pi/4-theta)) for 0 < theta < pi/8 which would process 8 points. but it's not simply a radix-8 thing. it's two radix-4 passes moving in opposite directions. there would be two special cases where theta = 0 and theta = pi/8 where code yanked outside of the loop would run separately (and efficiently). otherwise, i dunno what good radix-8 does over radix-4.
> &#4294967295;I can do > other things to be more efficient, but it usually isn't worth it to > improve something by 5 or 10% if it takes twice (or more) as long to > write and test it.
avoiding unnecessary multiplication by 1 or j seems to be worth it. it means (in the radix-4 case, anyway) that the very first butterfly of the group is yanked out of the DO loop and handled differently. that shouldn't be too hard to write and test.
> As far as the 'Two N real with an N complex' method - I think that, > even with an RTOS, it would be more efficient to use it instead of > doing two separate N point real point transforms.
i think that's pretty clear. if you are doing fast convolution, you can filter two separate *real* streams (or two overlapping blocks of the same stream) by stuffing one into the real part and the other into the imaginary part. do your overlap-add or overlap-save thing and what comes out in the real and imag parts are your two output streams (or two output blocks). r b-j
On Jan 30, 11:50 pm, kevin <kevinjmc...@netscape.net> wrote:
> .. > > Well, I used to have a TMS320 evaluation board, but that was ten years > ago.
So are you relating workstation benchmarks or benchmarks from a DSP architecture?
> ... > As for me, I usually stick with radix 8 to minimize multiplies, and > use look-up tables for twiddle factors and bit reverse. I can do > other things to be more efficient, but it usually isn't worth it to > improve something by 5 or 10% if it takes twice (or more) as long to > write and test it.
That's a common choice for people implementing benchmarks at their desk on a workstation. People implementing high volume embedded systems look at that tradeoff differently. Looking at the DSP55x architecture, the right answer (well, efficient implementation) for the OP probably relates more to DSP architectural features than workstation benchmarks. The DSP55x chips have a hardware fft accelerator (HWAFFT) that runs CX FFTs at 2 to 6 times the speed of the directly programmed CPU on the chip, so any scheme that leads to the use of larger transforms than supported by the HWAFFT is likely to be less efficient. On the other hand, if the OP actually has a homework problem to pack multiple channels of FFT into a single FFT call, we haven't helped the OP (or Rick) much. Dale B. Dalrymple