DSPRelated.com
Forums

FFT tricks for optimized multichannel processing

Started by rsdio January 22, 2011
I'm trying to understand whether it is possible to process multiple,
unrelated signals with a single pass of the FFT.

There is the very common trick mentioned in most DSP texts to speed up the
calculation of the FFT for Real data, by just shoving the audio data into
the Real and Imaginary parts of the FFT inputs and then calculating as if
there were half as many samples. Even though the Imaginary inputs have Real
data in them, the FFT produces the right results, although the order is
scrambled. If you unscramble the results, you get the exact frequency data
without running an FFT that is twice as long. This trick doesn't quite run
twice as fast, but the improvement is significant enough to be worthwhile.
Anyway, this paragraph is just background to give you an idea where I'm
heading. I already know how to apply this technique, and most FFT libraries
provide this speed savings by offering both real-input and complex-input
FFT routines.

Another technique that I read about recently is to send interleaved stereo
audio data as input to a single FFT calculation, and the results can be
obtained in less time than running two FFT calculations (one per channel).
Unfortunately, I did not save a link to this, and cannot find it now with
Google. So, my first question is: Has anyone heard of this technique, used
it, or have a link to a description of how it works or how to implement it?
 When processing stereo (2-channel) signals this way, are the FFT results
combined into one frequency set, or are they still available as discrete
frequency sets?

Now the really big question I have is whether a single FFT calculation can
process 4 or 8 independent channels. The FFT obviously works on 1 channel,
and there seem to be tricks to make it work with 2 channels. I'm just
wondering whether these ideas can be extended to 4 channels or 8 channels.
Each iteration of the FFT works by halving or doubling the number of
samples, with reverse bit addressing to de-scramble the order. It seems
like it might be possible to trick the FFT into 4 or 8 channels, but, then
again, perhaps it all falls apart after 2 channels. I also know that the
low-level implementation of the FFT - the Butterfly operation - can be done
radix-2 or radix-4 in a single pass on my processor, but I don't see any
radix-8 or beyond (and I don't know whether this is related to channel
maximums or potential speed savings). If anyone has deeper knowledge of the
DFT and can illuminate this topic, I'd appreciate it!

As background, I'm dealing with systems that have a single A/D with an
analog multiplexer input.  Thus, when configured for 4 channels or 8
channels, the incoming data is interleaved by the DMA that reads each
sample from the A/D and places it in the next memory location.  My current
implementation simply runs through this interleaved input array and
separates each channel data before calling a 1-channel FFT.  After reading
more about the FFT and various symmetries, I got the idea that it might
just be possible to call the FFT routine on the interleaved data and still
somehow be able to pull out discrete frequency sets for each unrelated
input.  The savings for my application would be that I could remove the
de-interleave pass through the data, which obviously burns a few cycles. 
Again, any insights would be appreciated.


On Jan 22, 9:04=A0pm, "rsdio"
<Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote:
> I'm trying to understand whether it is possible to process multiple, > unrelated signals with a single pass of the FFT. > > There is the very common trick mentioned in most DSP texts to speed up th=
e
> calculation of the FFT for Real data, by just shoving the audio data into > the Real and Imaginary parts of the FFT inputs and then calculating as if > there were half as many samples. Even though the Imaginary inputs have Re=
al
> data in them, the FFT produces the right results, although the order is > scrambled. If you unscramble the results, you get the exact frequency dat=
a
> without running an FFT that is twice as long. This trick doesn't quite ru=
n
> twice as fast, but the improvement is significant enough to be worthwhile=
.
> Anyway, this paragraph is just background to give you an idea where I'm > heading. I already know how to apply this technique, and most FFT librari=
es
> provide this speed savings by offering both real-input and complex-input > FFT routines. > > Another technique that I read about recently is to send interleaved stere=
o
> audio data as input to a single FFT calculation, and the results can be > obtained in less time than running two FFT calculations (one per channel)=
.
> Unfortunately, I did not save a link to this, and cannot find it now with > Google. So, my first question is: Has anyone heard of this technique, use=
d
> it, or have a link to a description of how it works or how to implement i=
t?
> =A0When processing stereo (2-channel) signals this way, are the FFT resul=
ts
> combined into one frequency set, or are they still available as discrete > frequency sets? > > Now the really big question I have is whether a single FFT calculation ca=
n
> process 4 or 8 independent channels. The FFT obviously works on 1 channel=
,
> and there seem to be tricks to make it work with 2 channels. I'm just > wondering whether these ideas can be extended to 4 channels or 8 channels=
.
> Each iteration of the FFT works by halving or doubling the number of > samples, with reverse bit addressing to de-scramble the order. It seems > like it might be possible to trick the FFT into 4 or 8 channels, but, the=
n
> again, perhaps it all falls apart after 2 channels. I also know that the > low-level implementation of the FFT - the Butterfly operation - can be do=
ne
> radix-2 or radix-4 in a single pass on my processor, but I don't see any > radix-8 or beyond (and I don't know whether this is related to channel > maximums or potential speed savings). If anyone has deeper knowledge of t=
he
> DFT and can illuminate this topic, I'd appreciate it! > > As background, I'm dealing with systems that have a single A/D with an > analog multiplexer input. =A0Thus, when configured for 4 channels or 8 > channels, the incoming data is interleaved by the DMA that reads each > sample from the A/D and places it in the next memory location. =A0My curr=
ent
> implementation simply runs through this interleaved input array and > separates each channel data before calling a 1-channel FFT. =A0After read=
ing
> more about the FFT and various symmetries, I got the idea that it might > just be possible to call the FFT routine on the interleaved data and stil=
l
> somehow be able to pull out discrete frequency sets for each unrelated > input. =A0The savings for my application would be that I could remove the > de-interleave pass through the data, which obviously burns a few cycles. > Again, any insights would be appreciated.
There are two distinct techniques: 1) Two N point real sequences transformed by one complex in/out N point FFT 2) One 2N point real sequence transformed by one complex in/out N point FFT Both of them are originally from: C. Bingham, M. D. Godfrey, J. W. Tukey, =93Modern Techniques of Power Spectrum Estimation,=94 IEEE Trans. Audio and Electroacoustics, AU-15, no. 2, June 1967, pp. 56-66. I have a tutorial on both methods: http://www.fftguru.com/fftguru.com.tutorial2.pdf I'd have more to say, but it's very late here, so I might add more comments on your question at a later time. Kevin McGee
Thanks, Kevin.  You've already answered one of my questions, which makes my
first posting on comp.dsp successful (for me, at least).

If you have any comments about my other questions, please feel free to
follow up as your time permits.
On Jan 23, 5:44=A0am, "rsdio"
<Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote:

> Thanks, Kevin. =A0You've already answered one of my questions, which make=
s my
> first posting on comp.dsp successful (for me, at least). > > If you have any comments about my other questions, please feel free to > follow up as your time permits.
As mentioned in the link I posted, there are two basic techniques: 1) Two N point real with and N point complex FFT 2) 2N real with an N point complex FFT The second technique makes use of the first. For your purposes, the first technique is what you would use to transform two N point real sequences with one N point complex FFT. As for processing say, 4 or 8 data channels, it certainly can (and is) done in hardware, as in, for example, parallel pipelines with a single control structure. But to mimic that processing in software would require that you modify the internals of your FFT algorithm, and that could be difficult and time-consuming. Unless you have a compelling reason to do so, I would suggest that you stick with the 'Two N real with an N complex FFT' approach. As for radix, radix 8 is more efficient than radix 4, which in turn is more efficient than radix 2. My guess is that some people are comfortable with radix 2, fewer with radix 4, and fewer still with radix 8. Kevin McGee
>As for processing say, 4 or 8 data channels, it certainly can (and is) >done in hardware, as in, for example, parallel pipelines with a single >control structure. > >But to mimic that processing in software would require that you modify >the internals of your FFT algorithm, and that could be difficult and >time-consuming.
Maybe I'm crazy, but I am considering this. I am working on the TMS320VC5506, and Texas Instruments provides the source for radix 2 and radix 4. I've been able to write some fairly efficient vector processing routines, so perhaps it's time to work with the FFT internals. Maybe with a few pointers in the right direction, I could tackle my specific 8-channel input. I'm thinking of it as a low-priority potential for optimization to reduce the total cycle count. So, if I'm not successful, it's probably no great loss. I'm sure I would learn something along the way. On that note, I have scaled and unscaled variations. The unscaled starts with radix 4 and continues with radix 2. The scaled always uses radix 2 because it divides by 2 after each. Since I have 12-bit input data from the ADS, and a 16-bit processor, I'm thinking of rolling together a 64-point FFT by combining the unscaled radix 4 and a couple of unscaled radix 2 passes, which should bring my 12-bit data to the 16-bit level, and then continue with scaled radix 2 passes to prevent overflow. Since the unscaled butterfly is faster on this processor, I would gain some of the speed advantages of the purely unscaled implementation, without overflowing. This is all potentially unrelated to my original question about processing all 8 unrelated channels in one transform, but I'm trying to illustrate that I am interested in exploring this as time permits.
>As for radix, radix 8 is more efficient than radix 4, which in turn is >more efficient than radix 2. My guess is that some people are >comfortable with radix 2, fewer with radix 4, and fewer still with >radix 8.
Despite my willingness to juggle the available source I have for radix 2 and radix 4, I don't think I quite have the experience to write my own radix 8 code for this chip. I don't know why they didn't provide open source for radix 8, other than the usual caveat that they're making their money on selling the chips and not on providing an exhaustive set of code for their customers. Then again, it could be that the number of registers in the C55x makes radix 8 more difficult or less efficient. I wouldn't know, since it's outside of my experience so far. In any case, it's interesting to read that radix 8 is more efficient, because I'll now be on the lookout for more educational information about this particular aspect. Thanks again for the additional explanations. Brian Willoughby Sound Consulting
On Sat, 22 Jan 2011 20:04:00 -0600, "rsdio"
<Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote:

>I'm trying to understand whether it is possible to process multiple, >unrelated signals with a single pass of the FFT. > >There is the very common trick mentioned in most DSP texts to speed up the >calculation of the FFT for Real data, by just shoving the audio data into >the Real and Imaginary parts of the FFT inputs and then calculating as if >there were half as many samples. Even though the Imaginary inputs have Real >data in them, the FFT produces the right results, although the order is >scrambled. If you unscramble the results, you get the exact frequency data >without running an FFT that is twice as long. This trick doesn't quite run >twice as fast, but the improvement is significant enough to be worthwhile. >Anyway, this paragraph is just background to give you an idea where I'm >heading. I already know how to apply this technique, and most FFT libraries >provide this speed savings by offering both real-input and complex-input >FFT routines. > >Another technique that I read about recently is to send interleaved stereo >audio data as input to a single FFT calculation, and the results can be >obtained in less time than running two FFT calculations (one per channel). >Unfortunately, I did not save a link to this, and cannot find it now with >Google. So, my first question is: Has anyone heard of this technique, used >it, or have a link to a description of how it works or how to implement it? > When processing stereo (2-channel) signals this way, are the FFT results >combined into one frequency set, or are they still available as discrete >frequency sets? > >Now the really big question I have is whether a single FFT calculation can >process 4 or 8 independent channels. The FFT obviously works on 1 channel, >and there seem to be tricks to make it work with 2 channels. I'm just >wondering whether these ideas can be extended to 4 channels or 8 channels. >Each iteration of the FFT works by halving or doubling the number of >samples, with reverse bit addressing to de-scramble the order. It seems >like it might be possible to trick the FFT into 4 or 8 channels, but, then >again, perhaps it all falls apart after 2 channels. I also know that the >low-level implementation of the FFT - the Butterfly operation - can be done >radix-2 or radix-4 in a single pass on my processor, but I don't see any >radix-8 or beyond (and I don't know whether this is related to channel >maximums or potential speed savings). If anyone has deeper knowledge of the >DFT and can illuminate this topic, I'd appreciate it! > >As background, I'm dealing with systems that have a single A/D with an >analog multiplexer input. Thus, when configured for 4 channels or 8 >channels, the incoming data is interleaved by the DMA that reads each >sample from the A/D and places it in the next memory location. My current >implementation simply runs through this interleaved input array and >separates each channel data before calling a 1-channel FFT. After reading >more about the FFT and various symmetries, I got the idea that it might >just be possible to call the FFT routine on the interleaved data and still >somehow be able to pull out discrete frequency sets for each unrelated >input. The savings for my application would be that I could remove the >de-interleave pass through the data, which obviously burns a few cycles. >Again, any insights would be appreciated.
Hi rsdio, Wow! you asked such a good question! I'm not able to add anything useful to Kevin's posts at this time, but I'm going to think some more about your question. If a solution can be found, then it would make for a useful article in my, and Clay Turner's, "DSP Tips & Tricks" column of the IEEE Sig. Proc. Magazine. [-Rick-]

rsdio wrote:

> I'm trying to understand whether it is possible to process multiple, > unrelated signals with a single pass of the FFT.
You can modify an ordinary FFT to work in a SIMD manner, i.e. performing every operation over several data sets at the same time. There will be some efficiency advantage as there is no need to reload the constants and recalculate the indexes, and some disadvantage as there may be not enough of CPU registers for the local optimization. I've done that with the real valued FFT; works fine. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
>rsdio wrote: >> I'm trying to understand whether it is possible to process multiple, >> unrelated signals with a single pass of the FFT. > >You can modify an ordinary FFT to work in a SIMD manner, i.e. performing >every operation over several data sets at the same time. There will be >some efficiency advantage as there is no need to reload the constants >and recalculate the indexes, and some disadvantage as there may be not >enough of CPU registers for the local optimization. I've done that with >the real valued FFT; works fine. > >Vladimir Vassilevsky
Thanks. I think this is what Kevin McGee was hinting. However, it seems that this would technically be calculating multiple transforms at the same time. While it's a subtle distinction, I was asking whether a single transform could calculate more than 2 channels. As I mentioned, and Kevin confirmed with his links, it is possible for a single mathematical transform to process 2 real channels (or to get 2 times the bins from 1 real channel). The core of my question is whether a single transform can deal with more than 2 (real) signals. In my specific application, I have a hunch that it would not be possible to calculate multiple transforms at the same time. I'm guessing that since there isn't a readily available radix-8 butterfly, then the processor probably has too few registers to calculate multiple transforms in SIMD fashion. I could be way off, though, which is why I am bringing this up on comp.dsp Brian Willoughby Sound Consulting
(snip, someone wrote)

>>You can modify an ordinary FFT to work in a SIMD manner, i.e. performing >>every operation over several data sets at the same time. There will be >>some efficiency advantage as there is no need to reload the constants >>and recalculate the indexes, and some disadvantage as there may be not >>enough of CPU registers for the local optimization. I've done that with >>the real valued FFT; works fine.
(snip, someone else wrote)
> Thanks. I think this is what Kevin McGee was hinting. However, it seems > that this would technically be calculating multiple transforms at the same > time. While it's a subtle distinction, I was asking whether a single > transform could calculate more than 2 channels.
> As I mentioned, and Kevin confirmed with his links, it is possible for a > single mathematical transform to process 2 real channels (or to get 2 times > the bins from 1 real channel). The core of my question is whether a single > transform can deal with more than 2 (real) signals.
There is a common trick to do a real transform of length 2N as a complex transform of length N. You should also be able to do two real transforms of length N as one complex transform of length N. It seems to me, then, that it shouldn't be hard to do M transforms of length N as a length M*N transform, at least for power of two M's. Then add some work for putting together the longer transform and taking apart the result. It still might be faster than separate transforms. -- glen
>> As I mentioned, and Kevin confirmed with his links, it is possible for
a
>> single mathematical transform to process 2 real channels (or to get 2
times
>> the bins from 1 real channel). The core of my question is whether a
single
>> transform can deal with more than 2 (real) signals. > >There is a common trick to do a real transform of length 2N as >a complex transform of length N. You should also be able to do >two real transforms of length N as one complex transform of length N. > >It seems to me, then, that it shouldn't be hard to do M transforms >of length N as a length M*N transform, at least for power of two M's. > >Then add some work for putting together the longer transform >and taking apart the result. It still might be faster than >separate transforms. > >-- glen
Glen, you've just repeated my original posting that started this discussion. It's the details of your last paragraph that I am searching for. I have a hunch that bit-reversed addressing can possibly handle the splicing, but I have not worked it out in detail yet. I am working with powers of two, because in my specific application I have 8 channels of independent data (at a common sample rate). Now I understand why everyone maintains the entire thread when quoting on comp.dsp - I'm using DSPRelated.com, which archives and collects together all postings in a thread. So, I did a bad job of attributing the replies when quoting - partly because DSPRelated.com does not carry through the attributions, and partly because I assumed most folks would have access to the whole thread anyway. So, Glen, I'm not complaining, but rather apologizing for my posting of a confusing quotation. I should probably dust off my copy of trn Brian Willoughby Sound Consulting