comp.dsp | FFT tricks for optimized multichannel processing

I'm trying to understand whether it is possible to process multiple,
unrelated signals with a single pass of the FFT.

There is the very common trick mentioned in most DSP texts to speed up the
calculation of the FFT for Real data, by just shoving the audio data into
the Real and Imaginary parts of the FFT inputs and then calculating as if
there were half as many samples. Even though the Imaginary inputs have Real
data in them, the FFT produces the right results, although the order is
scrambled. If you unscramble the results, you get the exact frequency data
without running an FFT that is twice as long. This trick doesn't quite run
twice as fast, but the improvement is significant enough to be worthwhile.
Anyway, this paragraph is just background to give you an idea where I'm
heading. I already know how to apply this technique, and most FFT libraries
provide this speed savings by offering both real-input and complex-input
FFT routines.

Another technique that I read about recently is to send interleaved stereo
audio data as input to a single FFT calculation, and the results can be
obtained in less time than running two FFT calculations (one per channel).
Unfortunately, I did not save a link to this, and cannot find it now with
Google. So, my first question is: Has anyone heard of this technique, used
it, or have a link to a description of how it works or how to implement it?
 When processing stereo (2-channel) signals this way, are the FFT results
combined into one frequency set, or are they still available as discrete
frequency sets?

Now the really big question I have is whether a single FFT calculation can
process 4 or 8 independent channels. The FFT obviously works on 1 channel,
and there seem to be tricks to make it work with 2 channels. I'm just
wondering whether these ideas can be extended to 4 channels or 8 channels.
Each iteration of the FFT works by halving or doubling the number of
samples, with reverse bit addressing to de-scramble the order. It seems
like it might be possible to trick the FFT into 4 or 8 channels, but, then
again, perhaps it all falls apart after 2 channels. I also know that the
low-level implementation of the FFT - the Butterfly operation - can be done
radix-2 or radix-4 in a single pass on my processor, but I don't see any
radix-8 or beyond (and I don't know whether this is related to channel
maximums or potential speed savings). If anyone has deeper knowledge of the
DFT and can illuminate this topic, I'd appreciate it!

As background, I'm dealing with systems that have a single A/D with an
analog multiplexer input.  Thus, when configured for 4 channels or 8
channels, the incoming data is interleaved by the DMA that reads each
sample from the A/D and places it in the next memory location.  My current
implementation simply runs through this interleaved input array and
separates each channel data before calling a 1-channel FFT.  After reading
more about the FFT and various symmetries, I got the idea that it might
just be possible to call the FFT routine on the interleaved data and still
somehow be able to pull out discrete frequency sets for each unrelated
input.  The savings for my application would be that I could remove the
de-interleave pass through the data, which obviously burns a few cycles. 
Again, any insights would be appreciated.

Reply by kevin ●January 23, 20112011-01-23

On Jan 22, 9:04=A0pm, "rsdio"
<Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote:
> I'm trying to understand whether it is possible to process multiple,
> unrelated signals with a single pass of the FFT.
>
> There is the very common trick mentioned in most DSP texts to speed up th=
e
> calculation of the FFT for Real data, by just shoving the audio data into
> the Real and Imaginary parts of the FFT inputs and then calculating as if
> there were half as many samples. Even though the Imaginary inputs have Re=
al
> data in them, the FFT produces the right results, although the order is
> scrambled. If you unscramble the results, you get the exact frequency dat=
a
> without running an FFT that is twice as long. This trick doesn't quite ru=
n
> twice as fast, but the improvement is significant enough to be worthwhile=
.
> Anyway, this paragraph is just background to give you an idea where I'm
> heading. I already know how to apply this technique, and most FFT librari=
es
> provide this speed savings by offering both real-input and complex-input
> FFT routines.
>
> Another technique that I read about recently is to send interleaved stere=
o
> audio data as input to a single FFT calculation, and the results can be
> obtained in less time than running two FFT calculations (one per channel)=
.
> Unfortunately, I did not save a link to this, and cannot find it now with
> Google. So, my first question is: Has anyone heard of this technique, use=
d
> it, or have a link to a description of how it works or how to implement i=
t?
> =A0When processing stereo (2-channel) signals this way, are the FFT resul=
ts
> combined into one frequency set, or are they still available as discrete
> frequency sets?
>
> Now the really big question I have is whether a single FFT calculation ca=
n
> process 4 or 8 independent channels. The FFT obviously works on 1 channel=
,
> and there seem to be tricks to make it work with 2 channels. I'm just
> wondering whether these ideas can be extended to 4 channels or 8 channels=
.
> Each iteration of the FFT works by halving or doubling the number of
> samples, with reverse bit addressing to de-scramble the order. It seems
> like it might be possible to trick the FFT into 4 or 8 channels, but, the=
n
> again, perhaps it all falls apart after 2 channels. I also know that the
> low-level implementation of the FFT - the Butterfly operation - can be do=
ne
> radix-2 or radix-4 in a single pass on my processor, but I don't see any
> radix-8 or beyond (and I don't know whether this is related to channel
> maximums or potential speed savings). If anyone has deeper knowledge of t=
he
> DFT and can illuminate this topic, I'd appreciate it!
>
> As background, I'm dealing with systems that have a single A/D with an
> analog multiplexer input. =A0Thus, when configured for 4 channels or 8
> channels, the incoming data is interleaved by the DMA that reads each
> sample from the A/D and places it in the next memory location. =A0My curr=
ent
> implementation simply runs through this interleaved input array and
> separates each channel data before calling a 1-channel FFT. =A0After read=
ing
> more about the FFT and various symmetries, I got the idea that it might
> just be possible to call the FFT routine on the interleaved data and stil=
l
> somehow be able to pull out discrete frequency sets for each unrelated
> input. =A0The savings for my application would be that I could remove the
> de-interleave pass through the data, which obviously burns a few cycles.
> Again, any insights would be appreciated.

There are two distinct techniques:

1) Two N point real sequences transformed by one complex in/out N
point FFT
2) One 2N point real sequence transformed by one complex in/out N
point FFT

Both of them are originally from:

C. Bingham, M. D. Godfrey, J. W. Tukey, =93Modern Techniques of Power
Spectrum
Estimation,=94 IEEE Trans. Audio and Electroacoustics, AU-15, no. 2,
June 1967, pp. 56-66.

I have a tutorial on both methods:

http://www.fftguru.com/fftguru.com.tutorial2.pdf

I'd have more to say, but it's very late here, so I might add more
comments on your question at a later time.

Kevin McGee

Reply by rsdio ●January 23, 20112011-01-23

Thanks, Kevin.  You've already answered one of my questions, which makes my
first posting on comp.dsp successful (for me, at least).

If you have any comments about my other questions, please feel free to
follow up as your time permits.

Reply by kevin ●January 24, 20112011-01-24

On Jan 23, 5:44=A0am, "rsdio"
<Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote:

> Thanks, Kevin. =A0You've already answered one of my questions, which make=
s my
> first posting on comp.dsp successful (for me, at least).
>
> If you have any comments about my other questions, please feel free to
> follow up as your time permits.

As mentioned in the link I posted, there are two basic techniques:

1) Two N point real with and N point complex FFT
2) 2N real with an N point complex FFT

The second technique makes use of the first.

For your purposes, the first technique is what you would use to
transform two N point real sequences with one N point complex FFT.

As for processing say, 4 or 8 data channels, it certainly can (and is)
done in hardware, as in, for example, parallel pipelines with a single
control structure.

But to mimic that processing in software would require that you modify
the internals of your FFT algorithm, and that could be difficult and
time-consuming.  Unless you have a compelling reason to do so, I would
suggest that you stick with the 'Two N real with an N complex FFT'
approach.

As for radix, radix 8 is more efficient than radix 4, which in turn is
more efficient than radix 2.  My guess is that some people are
comfortable with radix 2, fewer with radix 4, and fewer still with
radix 8.

Kevin McGee

Reply by rsdio ●January 26, 20112011-01-26

>As for processing say, 4 or 8 data channels, it certainly can (and is)
>done in hardware, as in, for example, parallel pipelines with a single
>control structure.
>
>But to mimic that processing in software would require that you modify
>the internals of your FFT algorithm, and that could be difficult and
>time-consuming.

Maybe I'm crazy, but I am considering this.  I am working on the
TMS320VC5506, and Texas Instruments provides the source for radix 2 and
radix 4.  I've been able to write some fairly efficient vector processing
routines, so perhaps it's time to work with the FFT internals.

Maybe with a few pointers in the right direction, I could tackle my
specific 8-channel input.  I'm thinking of it as a low-priority potential
for optimization to reduce the total cycle count.  So, if I'm not
successful, it's probably no great loss.  I'm sure I would learn something
along the way.


On that note, I have scaled and unscaled variations.  The unscaled starts
with radix 4 and continues with radix 2.  The scaled always uses radix 2
because it divides by 2 after each.  Since I have 12-bit input data from
the ADS, and a 16-bit processor, I'm thinking of rolling together a
64-point FFT by combining the unscaled radix 4 and a couple of unscaled
radix 2 passes, which should bring my 12-bit data to the 16-bit level, and
then continue with scaled radix 2 passes to prevent overflow.  Since the
unscaled butterfly is faster on this processor, I would gain some of the
speed advantages of the purely unscaled implementation, without
overflowing.  This is all potentially unrelated to my original question
about processing all 8 unrelated channels in one transform, but I'm trying
to illustrate that I am interested in exploring this as time permits.


>As for radix, radix 8 is more efficient than radix 4, which in turn is
>more efficient than radix 2.  My guess is that some people are
>comfortable with radix 2, fewer with radix 4, and fewer still with
>radix 8.

Despite my willingness to juggle the available source I have for radix 2
and radix 4, I don't think I quite have the experience to write my own
radix 8 code for this chip.  I don't know why they didn't provide open
source for radix 8, other than the usual caveat that they're making their
money on selling the chips and not on providing an exhaustive set of code
for their customers.  Then again, it could be that the number of registers
in the C55x makes radix 8 more difficult or less efficient.  I wouldn't
know, since it's outside of my experience so far.

In any case, it's interesting to read that radix 8 is more efficient,
because I'll now be on the lookout for more educational information about
this particular aspect.

Thanks again for the additional explanations.

Brian Willoughby
Sound Consulting

Reply by Rick Lyons ●January 28, 20112011-01-28

On Sat, 22 Jan 2011 20:04:00 -0600, "rsdio"
<Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote:

>I'm trying to understand whether it is possible to process multiple,
>unrelated signals with a single pass of the FFT.
>
>There is the very common trick mentioned in most DSP texts to speed up the
>calculation of the FFT for Real data, by just shoving the audio data into
>the Real and Imaginary parts of the FFT inputs and then calculating as if
>there were half as many samples. Even though the Imaginary inputs have Real
>data in them, the FFT produces the right results, although the order is
>scrambled. If you unscramble the results, you get the exact frequency data
>without running an FFT that is twice as long. This trick doesn't quite run
>twice as fast, but the improvement is significant enough to be worthwhile.
>Anyway, this paragraph is just background to give you an idea where I'm
>heading. I already know how to apply this technique, and most FFT libraries
>provide this speed savings by offering both real-input and complex-input
>FFT routines.
>
>Another technique that I read about recently is to send interleaved stereo
>audio data as input to a single FFT calculation, and the results can be
>obtained in less time than running two FFT calculations (one per channel).
>Unfortunately, I did not save a link to this, and cannot find it now with
>Google. So, my first question is: Has anyone heard of this technique, used
>it, or have a link to a description of how it works or how to implement it?
> When processing stereo (2-channel) signals this way, are the FFT results
>combined into one frequency set, or are they still available as discrete
>frequency sets?
>
>Now the really big question I have is whether a single FFT calculation can
>process 4 or 8 independent channels. The FFT obviously works on 1 channel,
>and there seem to be tricks to make it work with 2 channels. I'm just
>wondering whether these ideas can be extended to 4 channels or 8 channels.
>Each iteration of the FFT works by halving or doubling the number of
>samples, with reverse bit addressing to de-scramble the order. It seems
>like it might be possible to trick the FFT into 4 or 8 channels, but, then
>again, perhaps it all falls apart after 2 channels. I also know that the
>low-level implementation of the FFT - the Butterfly operation - can be done
>radix-2 or radix-4 in a single pass on my processor, but I don't see any
>radix-8 or beyond (and I don't know whether this is related to channel
>maximums or potential speed savings). If anyone has deeper knowledge of the
>DFT and can illuminate this topic, I'd appreciate it!
>
>As background, I'm dealing with systems that have a single A/D with an
>analog multiplexer input.  Thus, when configured for 4 channels or 8
>channels, the incoming data is interleaved by the DMA that reads each
>sample from the A/D and places it in the next memory location.  My current
>implementation simply runs through this interleaved input array and
>separates each channel data before calling a 1-channel FFT.  After reading
>more about the FFT and various symmetries, I got the idea that it might
>just be possible to call the FFT routine on the interleaved data and still
>somehow be able to pull out discrete frequency sets for each unrelated
>input.  The savings for my application would be that I could remove the
>de-interleave pass through the data, which obviously burns a few cycles. 
>Again, any insights would be appreciated.

Hi rsdio,
  Wow! you asked such a good question!
I'm not able to add anything useful to 
Kevin's posts at this time, but I'm going to 
think some more about your question.

If a solution can be found, then it would 
make for a useful article in my, and Clay 
Turner's, "DSP Tips & Tricks" column of the 
IEEE Sig. Proc. Magazine.

[-Rick-]

Reply by Vladimir Vassilevsky ●January 28, 20112011-01-28

rsdio wrote:

> I'm trying to understand whether it is possible to process multiple,
> unrelated signals with a single pass of the FFT.

You can modify an ordinary FFT to work in a SIMD manner, i.e. performing 
every operation over several data sets at the same time. There will be 
some efficiency advantage as there is no need to reload the constants 
and recalculate the indexes, and some disadvantage as there may be not 
enough of CPU registers for the local optimization. I've done that with 
the real valued FFT; works fine.

Vladimir Vassilevsky
DSP and Mixed Signal Design Consultant
http://www.abvolt.com

Reply by rsdio ●January 28, 20112011-01-28

>rsdio wrote:
>> I'm trying to understand whether it is possible to process multiple,
>> unrelated signals with a single pass of the FFT.
>
>You can modify an ordinary FFT to work in a SIMD manner, i.e. performing 
>every operation over several data sets at the same time. There will be 
>some efficiency advantage as there is no need to reload the constants 
>and recalculate the indexes, and some disadvantage as there may be not 
>enough of CPU registers for the local optimization. I've done that with 
>the real valued FFT; works fine.
>
>Vladimir Vassilevsky

Thanks.  I think this is what Kevin McGee was hinting.  However, it seems
that this would technically be calculating multiple transforms at the same
time.  While it's a subtle distinction, I was asking whether a single
transform could calculate more than 2 channels.

As I mentioned, and Kevin confirmed with his links, it is possible for a
single mathematical transform to process 2 real channels (or to get 2 times
the bins from 1 real channel).  The core of my question is whether a single
transform can deal with more than 2 (real) signals.

In my specific application, I have a hunch that it would not be possible to
calculate multiple transforms at the same time.  I'm guessing that since
there isn't a readily available radix-8 butterfly, then the processor
probably has too few registers to calculate multiple transforms in SIMD
fashion.  I could be way off, though, which is why I am bringing this up on
comp.dsp

Brian Willoughby
Sound Consulting

Reply by glen herrmannsfeldt ●January 29, 20112011-01-29

(snip, someone wrote)

>>You can modify an ordinary FFT to work in a SIMD manner, i.e. performing 
>>every operation over several data sets at the same time. There will be 
>>some efficiency advantage as there is no need to reload the constants 
>>and recalculate the indexes, and some disadvantage as there may be not 
>>enough of CPU registers for the local optimization. I've done that with 
>>the real valued FFT; works fine.
(snip, someone else wrote)

> Thanks.  I think this is what Kevin McGee was hinting.  However, it seems
> that this would technically be calculating multiple transforms at the same
> time.  While it's a subtle distinction, I was asking whether a single
> transform could calculate more than 2 channels.
 
> As I mentioned, and Kevin confirmed with his links, it is possible for a
> single mathematical transform to process 2 real channels (or to get 2 times
> the bins from 1 real channel).  The core of my question is whether a single
> transform can deal with more than 2 (real) signals.

There is a common trick to do a real transform of length 2N as
a complex transform of length N.  You should also be able to do
two real transforms of length N as one complex transform of length N.

It seems to me, then, that it shouldn't be hard to do M transforms
of length N as a length M*N transform, at least for power of two M's.

Then add some work for putting together the longer transform
and taking apart the result.  It still might be faster than
separate transforms.

-- glen

Reply by rsdio ●January 29, 20112011-01-29

>> As I mentioned, and Kevin confirmed with his links, it is possible for
a
>> single mathematical transform to process 2 real channels (or to get 2
times
>> the bins from 1 real channel).  The core of my question is whether a
single
>> transform can deal with more than 2 (real) signals.
>
>There is a common trick to do a real transform of length 2N as
>a complex transform of length N.  You should also be able to do
>two real transforms of length N as one complex transform of length N.
>
>It seems to me, then, that it shouldn't be hard to do M transforms
>of length N as a length M*N transform, at least for power of two M's.
>
>Then add some work for putting together the longer transform
>and taking apart the result.  It still might be faster than
>separate transforms.
>
>-- glen

Glen, you've just repeated my original posting that started this
discussion.  It's the details of your last paragraph that I am searching
for.  I have a hunch that bit-reversed addressing can possibly handle the
splicing, but I have not worked it out in detail yet.  I am working with
powers of two, because in my specific application I have 8 channels of
independent data (at a common sample rate).

Now I understand why everyone maintains the entire thread when quoting on
comp.dsp - I'm using DSPRelated.com, which archives and collects together
all postings in a thread.  So, I did a bad job of attributing the replies
when quoting - partly because DSPRelated.com does not carry through the
attributions, and partly because I assumed most folks would have access to
the whole thread anyway.

So, Glen, I'm not complaining, but rather apologizing for my posting of a
confusing quotation.  I should probably dust off my copy of trn

Brian Willoughby
Sound Consulting

Previous12 3 Next

FFT tricks for optimized multichannel processing

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group