# implementing convolution of real sequences using half length complex FFTs

Started by March 12, 2011
```Greetings,

I'm stuck trying to figure out how to use half length complex FFTs to
implement the convolution of two real sequences and could use some
advice as I feel like I might be missing something (relatively)
obvious.  I'm trying to convolve two real valued sequences (of length
N, zero padded to length 2N) using two complex-valued FFTs (of length
N).  I'm doing this in C using FFTW though I don't think its
particularly relevant.

The first thing I tried was the naive (and apparently incorrect)
approach of casting the real valued sequences to complex valued
sequences, taking the complex FFTs of those sequence, doing pointwise
multiplication of the (complex) FFT coefficients and then taking the
inverse FFT of the resulting set of coefficients results in an output
sequence in which half the samples are correct and half are
incorrect.

Doing a bit of digging I came across numerous pages describing how one
can use a single complex FFT to compute the FFTs of two real
sequences, and how one can combine those two FFTs into the FFT of a
single sequence of twice the length (ala the last step of a decimation
in time FFT implementation).  This works as expected but what I'm
trying to accomplish is a bit different.

What I'm trying to do is to partition the work associated with a
single long FFT across multiple processing frames (this is in a real
time audio processing context). For example, I'd like to compute a 256
point FFT by partitioning the work across 4 processing frames.
Without getting into too many details, I'm basically applying a radix
4 decimation in frequency to the input sequence to generate four
subsequences, taking the FFTs of those subsequences, doing pointwise
multiplication of the (complex) FFT coefficients of the input with the
corresponding FFT coefficients of a stored impulse response (that has
had the same DIF applied), taking the inverse FFT of the resulting
complex values and then constructing the output sequence by doing a
reverse decimation in frequency step.  Hope that made sense...

So when I do the naive thing and cast the real sequence to a complex
sequence and follow the above steps, the result is the same as
described above when using a single long FFT (no subdivision) - half
the outputs are correct (those that correspond to the imaginary
components of the complex sequence, that is the odd values of the real
sequence).  The thing I'm struggling with is how to descramble (or
whatever you want to call it) the FFT coefficients after the DIF step
has been applied in order to get correct output values.  The thing is
I need to be able to do this independently for each of the 4
subsequences resulting from the radix 4 DIF step (the FFT of the first
sequence gives every 4th FFT coefficient of the original sequence
(4k), the next sequence gives the (4k+1) coefficients, etc) but the
methods I've encountered that describe how to do the descrambling
require coefficients from distinct subsequences (i.e. to generate the
"unscrambled" version of the 2nd FFT coefficient of a 16 point FFT
requires the values of the 2nd and 16th "scrambled" FFT coefficients
which are not part of the same subsequence).

I'm hoping there's some way to do a "pre descrambling" in the time
domain before I convert things to the frequency domain or something
along those lines.  In any case, if anyone understands what I'm
talking about and has any insight I would be most grateful.  I've been
trying to wrap my head around this for a couple of days and am getting
a bit frustrated (I'm not really a DSP guy).    I'd be happy to
clarify anything if something doesn't make sense.

Thanks in advance for any help.

```
```On Mar 12, 4:05&#2013266080;pm, rimas <ri...@cnmat.berkeley.edu> wrote:
> Greetings,
>
> I'm stuck trying to figure out how to use half length complex FFTs to
> implement the convolution of two real sequences and could use some
> advice as I feel like I might be missing something (relatively)
> obvious. &#2013266080;I'm trying to convolve two real valued sequences (of length
> N, zero padded to length 2N) using two complex-valued FFTs (of length
> N). &#2013266080;I'm doing this in C using FFTW though I don't think its
> particularly relevant.
>
> The first thing I tried was the naive (and apparently incorrect)
> approach of casting the real valued sequences to complex valued
> sequences, taking the complex FFTs of those sequence, doing pointwise
> multiplication of the (complex) FFT coefficients and then taking the
> inverse FFT of the resulting set of coefficients results in an output
> sequence in which half the samples are correct and half are
> incorrect.

Well, it works if what you're trying to do is a circular convolution.
But what you probably want to do is a linear convolution, which is:
take the two N point real sequences, zero pad both to 2N points, take
the 2N FFT of each, multiply them, point by point and then 2N inverse
transform.

> Doing a bit of digging I came across numerous pages describing how one
> can use a single complex FFT to compute the FFTs of two real
> sequences, and how one can combine those two FFTs into the FFT of a
> single sequence of twice the length (ala the last step of a decimation
> in time FFT implementation). &#2013266080;This works as expected but what I'm
> trying to accomplish is a bit different.

In doing a linear convolution, as mentioned above, you could make use
of the 'Two N point real sequences with an N point complex FFT' by
zero padding your sequences to 2N, using one as the real input and the
other as the imaginary input to a 2N complex FFT.  After separating
the two transforms in the results, you'd proceed to the multiply/
inverse part.

The above can be modified to make use of the fact that half your
inputs to the complex 2N FFT are zero, but you might want to get
things working properly at first, rather than trying to optimize
something that doesn't work.

> What I'm trying to do is to partition the work associated with a
> single long FFT across multiple processing frames (this is in a real
> time audio processing context). For example, I'd like to compute a 256
> point FFT by partitioning the work across 4 processing frames.
> Without getting into too many details, I'm basically applying a radix
> 4 decimation in frequency to the input sequence to generate four
> subsequences, taking the FFTs of those subsequences, doing pointwise
> multiplication of the (complex) FFT coefficients of the input with the
> corresponding FFT coefficients of a stored impulse response (that has
> had the same DIF applied), taking the inverse FFT of the resulting
> complex values and then constructing the output sequence by doing a
> reverse decimation in frequency step. &#2013266080;Hope that made sense...
>
> So when I do the naive thing and cast the real sequence to a complex
> sequence and follow the above steps, the result is the same as
> described above when using a single long FFT (no subdivision) - half
> the outputs are correct (those that correspond to the imaginary
> components of the complex sequence, that is the odd values of the real
> sequence). &#2013266080;The thing I'm struggling with is how to descramble (or
> whatever you want to call it) the FFT coefficients after the DIF step
> has been applied in order to get correct output values. &#2013266080;The thing is
> I need to be able to do this independently for each of the 4
> subsequences resulting from the radix 4 DIF step (the FFT of the first
> sequence gives every 4th FFT coefficient of the original sequence
> (4k), the next sequence gives the (4k+1) coefficients, etc) but the
> methods I've encountered that describe how to do the descrambling
> require coefficients from distinct subsequences (i.e. to generate the
> "unscrambled" version of the 2nd FFT coefficient of a 16 point FFT
> requires the values of the 2nd and 16th "scrambled" FFT coefficients
> which are not part of the same subsequence).

If I understand you correctly, you're taking 4 consecutive points of
an N point sequence and applying a radix-4 DIF on the 4 points to get
4 sub-sequences, and doing FFTs on them.  If that is so, stop right
there.

I know of no conventional FFT with a radix-4 first stage that doesn't
require its inputs to be separated by N/4.  This is the case no matter
how you rearrange the graph(s), or what decimation or radix is used
subsequently.

There are in fact ways to split up an FFT, but not the way you're
doing it.

> I'm hoping there's some way to do a "pre descrambling" in the time
> domain before I convert things to the frequency domain or something
> along those lines. &#2013266080;In any case, if anyone understands what I'm
> talking about and has any insight I would be most grateful. &#2013266080;I've been
> trying to wrap my head around this for a couple of days and am getting
> a bit frustrated (I'm not really a DSP guy). &#2013266080; &#2013266080;I'd be happy to
> clarify anything if something doesn't make sense.
>
> Thanks in advance for any help.

Perhaps what you might be looking for is something called 'overlap-
add' or 'overlap-save' processing.  It allows you to section a long
input sequence into shorter sequences and do what is known as
'sectioned' convolutions.

What are the actual numbers you're dealing with?

Kevin McGee
```
```On Mar 13, 11:44&#2013266080;pm, kevin <kevinjmc...@netscape.net> wrote:
> On Mar 12, 4:05&#2013266080;pm, rimas <ri...@cnmat.berkeley.edu> wrote:
> > Greetings,
>
> > I'm stuck trying to figure out how to use half length complex FFTs to
> > implement the convolution of two real sequences and could use some
> > advice as I feel like I might be missing something (relatively)
> > obvious. &#2013266080;I'm trying to convolve two real valued sequences (of length
> > N, zero padded to length 2N) using two complex-valued FFTs (of length
> > N). &#2013266080;I'm doing this in C using FFTW though I don't think its
> > particularly relevant.
>
> > The first thing I tried was the naive (and apparently incorrect)
> > approach of casting the real valued sequences to complex valued
> > sequences, taking the complex FFTs of those sequence, doing pointwise
> > multiplication of the (complex) FFT coefficients and then taking the
> > inverse FFT of the resulting set of coefficients results in an output
> > sequence in which half the samples are correct and half are
> > incorrect.
>
> Well, it works if what you're trying to do is a circular convolution.
> But what you probably want to do is a linear convolution, which is:
> take the two N point real sequences, zero pad both to 2N points, take
> the 2N FFT of each, multiply them, point by point and then 2N inverse
> transform.
>

Yes, that's what I'm doing. At the moment I am using the overlap-save
method (so the N points of the input sequence aren't padded with N
zeros, they are combined with the N samples from the previous input
frame).

> > Doing a bit of digging I came across numerous pages describing how one
> > can use a single complex FFT to compute the FFTs of two real
> > sequences, and how one can combine those two FFTs into the FFT of a
> > single sequence of twice the length (ala the last step of a decimation
> > in time FFT implementation). &#2013266080;This works as expected but what I'm
> > trying to accomplish is a bit different.
>
> In doing a linear convolution, as mentioned above, you could make use
> of the 'Two N point real sequences with an N point complex FFT' by
> zero padding your sequences to 2N, using one as the real input and the
> other as the imaginary input to a 2N complex FFT. &#2013266080;After separating
> the two transforms in the results, you'd proceed to the multiply/
> inverse part.
>

Yeah that would work except for the fact that the impulse response I'm
convolving the input sequence with is static, so there's no need to
compute it's FFT more than once (so I compute it once and store it).

> The above can be modified to make use of the fact that half your
> inputs to the complex 2N FFT are zero, but you might want to get
> things working properly at first, rather than trying to optimize
> something that doesn't work.
>
> > What I'm trying to do is to partition the work associated with a
> > single long FFT across multiple processing frames (this is in a real
> > time audio processing context). For example, I'd like to compute a 256
> > point FFT by partitioning the work across 4 processing frames.
> > Without getting into too many details, I'm basically applying a radix
> > 4 decimation in frequency to the input sequence to generate four
> > subsequences, taking the FFTs of those subsequences, doing pointwise
> > multiplication of the (complex) FFT coefficients of the input with the
> > corresponding FFT coefficients of a stored impulse response (that has
> > had the same DIF applied), taking the inverse FFT of the resulting
> > complex values and then constructing the output sequence by doing a
> > reverse decimation in frequency step. &#2013266080;Hope that made sense...
>
> > So when I do the naive thing and cast the real sequence to a complex
> > sequence and follow the above steps, the result is the same as
> > described above when using a single long FFT (no subdivision) - half
> > the outputs are correct (those that correspond to the imaginary
> > components of the complex sequence, that is the odd values of the real
> > sequence). &#2013266080;The thing I'm struggling with is how to descramble (or
> > whatever you want to call it) the FFT coefficients after the DIF step
> > has been applied in order to get correct output values. &#2013266080;The thing is
> > I need to be able to do this independently for each of the 4
> > subsequences resulting from the radix 4 DIF step (the FFT of the first
> > sequence gives every 4th FFT coefficient of the original sequence
> > (4k), the next sequence gives the (4k+1) coefficients, etc) but the
> > methods I've encountered that describe how to do the descrambling
> > require coefficients from distinct subsequences (i.e. to generate the
> > "unscrambled" version of the 2nd FFT coefficient of a 16 point FFT
> > requires the values of the 2nd and 16th "scrambled" FFT coefficients
> > which are not part of the same subsequence).
>
> If I understand you correctly, you're taking 4 consecutive points of
> an N point sequence and applying a radix-4 DIF on the 4 points to get
> 4 sub-sequences, and doing FFTs on them. &#2013266080;If that is so, stop right
> there.
>
> I know of no conventional FFT with a radix-4 first stage that doesn't
> require its inputs to be separated by N/4. &#2013266080;This is the case no matter
> how you rearrange the graph(s), or what decimation or radix is used
> subsequently.
>
> There are in fact ways to split up an FFT, but not the way you're
> doing it.
>

Sorry if I wasn't clear - I am in fact using a radix 4 DIF stage to
convert one N point (complex) sequence into 4 N/4 point complex
sequences, where the FFT coefficients of each of the subsequences are
a subset of the FFT coefficients of the input sequence (the FFT of
sequence one has FFT coefficients with indices 0,4,8,12 of the
original sequence, sequence two 1,5,9,13, etc).

> > I'm hoping there's some way to do a "pre descrambling" in the time
> > domain before I convert things to the frequency domain or something
> > along those lines. &#2013266080;In any case, if anyone understands what I'm
> > talking about and has any insight I would be most grateful. &#2013266080;I've been
> > trying to wrap my head around this for a couple of days and am getting
> > a bit frustrated (I'm not really a DSP guy). &#2013266080; &#2013266080;I'd be happy to
> > clarify anything if something doesn't make sense.
>
> > Thanks in advance for any help.
>
> Perhaps what you might be looking for is something called 'overlap-
> add' or 'overlap-save' processing. &#2013266080;It allows you to section a long
> input sequence into shorter sequences and do what is known as
> 'sectioned' convolutions.
>

Yes, that is in fact what I'm doing.  I am using overlap-save now but
samples in each input vector be 0 would simplify things.  I am doing
partitioned convolution with two different partition sizes - at the
moment the primary partition size is 32 samples and the secondary
partition size is 256 samples.

> What are the actual numbers you're dealing with?
>

> Kevin McGee

```
```On Mar 14, 1:04&#2013266080;pm, rimas <ri...@cnmat.berkeley.edu> wrote:

[ ...]

> Sorry if I wasn't clear - I am in fact using a radix 4 DIF stage to
> convert one N point (complex) sequence into 4 N/4 point complex
> sequences, where the FFT coefficients of each of the subsequences are
> a subset of the FFT coefficients of the input sequence (the FFT of
> sequence one has FFT coefficients with indices 0,4,8,12 of the
> original sequence, sequence two 1,5,9,13, etc).

OK.  That could actually work IF you keep the twiddling and bit-
reversal relationships straight.

For example, Fig. 6.14 on p. 380 of Rabiner and Gold's 1975 DSP book
shows a radix-2 32 point DIF graph.  The first radix-2 stage has
twiddles of 0 to 15, and the second has 2 groups of even twiddles from
0 to 14 (i.e.: 0, 2, 4, &#2013266053; 14).  The subsequent 3 stages are just four
groups of 8 point FFTs.

Now, if you replace those first 2 radix-2 stages with a 1 stage
FFT (whether DIT or DIF) will only have internal twiddles of 0
(multiply by 1) and 1 (multiply by -j).

There is a way out of this mess by adding the appropriate inter-stage
twiddles after the first radix-4 stage.

you're changing the overall graph into a two stage algorithm, and
you've created a bit-reversal problem as well.

I know I'm not being terribly clear about things - without some graphs
to look at, it's tough to see.  But suffice to say that, unless you
know exactly what you're doing, it's inadvisable to mess around with
the internals of your FFT algorithm.

[..sectioned convolution]
> Yes, that is in fact what I'm doing. &#2013266080;I am using overlap-save now but
> samples in each input vector be 0 would simplify things. &#2013266080;I am doing
> partitioned convolution with two different partition sizes - at the
> moment the primary partition size is 32 samples and the secondary
> partition size is 256 samples.

As stated before, I wouldn't be overly concerned with efficiency right
now.  I think it's more important to get something working correctly
at first.  That way, you get a 'known good' routine that you can use
as baseline for comparison when doing subsequent modifications to
improve efficiency.

Kevin McGee
```
```On Mar 15, 12:01&#2013266080;am, kevin <kevinjmc...@netscape.net> wrote:
> On Mar 14, 1:04&#2013266080;pm, rimas <ri...@cnmat.berkeley.edu> wrote:
>
> [ ...]
>
> > Sorry if I wasn't clear - I am in fact using a radix 4 DIF stage to
> > convert one N point (complex) sequence into 4 N/4 point complex
> > sequences, where the FFT coefficients of each of the subsequences are
> > a subset of the FFT coefficients of the input sequence (the FFT of
> > sequence one has FFT coefficients with indices 0,4,8,12 of the
> > original sequence, sequence two 1,5,9,13, etc).
>
> OK. &#2013266080;That could actually work IF you keep the twiddling and bit-
> reversal relationships straight.
>
> For example, Fig. 6.14 on p. 380 of Rabiner and Gold's 1975 DSP book
> shows a radix-2 32 point DIF graph. &#2013266080;The first radix-2 stage has
> twiddles of 0 to 15, and the second has 2 groups of even twiddles from
> 0 to 14 (i.e.: 0, 2, 4, &#2013266053; 14). &#2013266080;The subsequent 3 stages are just four
> groups of 8 point FFTs.
>
> Now, if you replace those first 2 radix-2 stages with a 1 stage
> radix-4 DIF, you've messed up the twiddling. &#2013266080;That's because a radix-4
> FFT (whether DIT or DIF) will only have internal twiddles of 0
> (multiply by 1) and 1 (multiply by -j).
>
> There is a way out of this mess by adding the appropriate inter-stage
> twiddles after the first radix-4 stage.
>
> Basically, by replacing the first two radix-2 stages with a radix-4,
> you're changing the overall graph into a two stage algorithm, and
> you've created a bit-reversal problem as well.
>
> I know I'm not being terribly clear about things - without some graphs
> to look at, it's tough to see. &#2013266080;But suffice to say that, unless you
> know exactly what you're doing, it's inadvisable to mess around with
> the internals of your FFT algorithm.
>
> [..sectioned convolution]
>
> > Yes, that is in fact what I'm doing. &#2013266080;I am using overlap-save now but
> > could conceivably use overlap-add instead if having half of the
> > samples in each input vector be 0 would simplify things. &#2013266080;I am doing
> > partitioned convolution with two different partition sizes - at the
> > moment the primary partition size is 32 samples and the secondary
> > partition size is 256 samples.
>
> As stated before, I wouldn't be overly concerned with efficiency right
> now. &#2013266080;I think it's more important to get something working correctly
> at first. &#2013266080;That way, you get a 'known good' routine that you can use
> as baseline for comparison when doing subsequent modifications to
> improve efficiency.
>
> Kevin McGee

Thanks again for the reply.  I do in fact have a working version of
this algorithm where I represent everything (including the real input
and impulse response time domain sequences) as complex values but by
my reckoning it's roughly 2x slower than it ought to be if I was
taking advantage of the symmetries and such associated with FFTs of
real sequences.  I'll take a look at the diagram you referenced,
thanks for the pointer.

```