# cross correlation and FFT problem

Started by July 4, 2005
```Hi my friends,
I don't find solution for this, i give you the example
i have two signals
x=[1 0 1 1 2 3 4 4 1 0 1 0 1];
y=[1 1 1 1 1 1 2 3 4 4 1 0 1 0 1];
I calculated cross correlation with xcorr in matlab
z=xcorr(x,y), length(z)=2*max(length(x),length(y))-1
[m,I]=max(z)
m=51, I=17
i'm trying to compare this to rxy
L=max(length(x),length(y))
NFFT=2*L
xx=fft(x,NFFT)
yy=fft(y,NFFT)
Sxy=xx.*conj(yy)
rxy=fftshift(real(ifft(Sxy)));
[k,J]=max(rxy)
k=51
J=14
my question is why I is different to J
may be the NFFT is not correct, can you help me please
thanks a lot

This message was sent using the Comp.DSP web interface on
www.DSPRelated.com
```
```Hi,

Be careful about the length of the FFT; you need a length at least the
sum of the lenghts of the two sequences minus one. The reason is that
the frequency domain operation is really a circular convolution.

If you end up requiring a weird FFT length you may want to take a look
at my FFT routine available from my homepage.

Best regards,
Jens J. Nielsen
http:/home.get2net.dk/jjn

On Mon, 04 Jul 2005 15:10:32 -0500, "signal" <monirov@hotmail.com>
wrote:

>Hi my friends,
>I don't find solution for this, i give you the example
>i have two signals
>x=[1 0 1 1 2 3 4 4 1 0 1 0 1];
>y=[1 1 1 1 1 1 2 3 4 4 1 0 1 0 1];
>I calculated cross correlation with xcorr in matlab
>z=xcorr(x,y), length(z)=2*max(length(x),length(y))-1
>[m,I]=max(z)
>m=51, I=17
>i'm trying to compare this to rxy
>L=max(length(x),length(y))
>NFFT=2*L
>xx=fft(x,NFFT)
>yy=fft(y,NFFT)
>Sxy=xx.*conj(yy)
>rxy=fftshift(real(ifft(Sxy)));
>[k,J]=max(rxy)
>k=51
>J=14
>my question is why I is different to J
>may be the NFFT is not correct, can you help me please
>thanks a lot
>
>
>
>
>
>This message was sent using the Comp.DSP web interface on
>www.DSPRelated.com

```
```
Jens J&#2013266168;rgen Nielsen wrote:
> Hi,
>
> Be careful about the length of the FFT; you need a length at least the
> sum of the lenghts of the two sequences minus one. The reason is that
> the frequency domain operation is really a circular convolution.

A rather useless operation IMHO.  :-)

Bob
--

"Things should be described as simply as possible, but no
simpler."

A. Einstein
```
```Is there any good examples (like commercial products or standards)
using the frequency domain for efficient convolution/filtering?

I guess that's what Bob's comment is about!?

Jens

I tend to agree with Bob's comment about doing convolution in the
frequency domain. Even though it's theoretically more efficient than
the time domain operation in certain cases I'm not familiar with any
practical implementations

On Wed, 28 Sep 2005 21:59:13 -0700, Bob Cain
<arcane@arcanemethods.com> wrote:

>
>
>Jens J&#2013266168;rgen Nielsen wrote:
>> Hi,
>>
>> Be careful about the length of the FFT; you need a length at least the
>> sum of the lenghts of the two sequences minus one. The reason is that
>> the frequency domain operation is really a circular convolution.
>
>A rather useless operation IMHO.  :-)
>
>
>Bob

```
```I disagree with the conclusion.  Jens, at least you are more diplomatic
in your opinions.  Bob, you shouldn't claim something is useless, just
because you can't find a use for it.

I can see how you'd form your opinion *if* you've mainly worked in the
fixed-point world.

I assume everyone here knows that fast convolution filtering becomes
increasingly more efficient for longer filter lengths.  Below some
crossover point, direct convolution is more efficient. Above that point,
fast convolution is quicker. I will not spend time talking about this
point, since I doubt it is contested.

For a fixed point DSP processor, fast convolution has a couple of
strikes against it:
1) lack of dynamic range in a fixed-point FFT. This prohibits large FFTs
required for large filters
2) DSP chips are already optimized for direct FIR compution/correlation
Most (all) DSP chips have a single cycle multiply-accumulate
operation.  This moves the crossover point for direct vs. fast
convolution much higher than the theoretical 25-30 taps.
3) Such systems are usually memory-poor.
4) For the applications built on fixed-point DSPs, latency requirements
are just as important than throughput requirements.

These factors act to squeeze down the useful range of fast convolution.
The filters cannot be too big, because the FFT turns to mush.  For a
small filter, you'd be better off using direct computation. ON A FIXED
POINT PLATFORM.!!!!

On the other hand:
For floating point on a desktop/server processor, fast convolution is my
filtering tool of choice.  Unless you are using REALLY short filters,
you'd probably be better off with fast convolution.

1) The dynamic range of a decent floating point FFT is not really an issue.
2) Efficient SIMD FFT libraries such as FFTW or IPP make the crossover
point for direct vs fast filtering much lower than the theoretical 25-30
taps, more like 16 taps.
3) The extra memory required for fast convolution is negligible on a
desktop/server.
4) desktop/server DSP applications are often file-based. For these,
latency is a non-issue.

-- Mark

Jens J&#2013266168;rgen Nielsen wrote:
> Is there any good examples (like commercial products or standards)
> using the frequency domain for efficient convolution/filtering?
>
> I guess that's what Bob's comment is about!?
>
> Jens
>
> I tend to agree with Bob's comment about doing convolution in the
> frequency domain. Even though it's theoretically more efficient than
> the time domain operation in certain cases I'm not familiar with any
> practical implementations
>
> On Wed, 28 Sep 2005 21:59:13 -0700, Bob Cain
> <arcane@arcanemethods.com> wrote:
>

[]
>>>the frequency domain operation is really a circular convolution.
>>
>>A rather useless operation IMHO.  :-)
>>
>>
>>Bob
>
>
```
```
Jens J&#2013266168;rgen Nielsen wrote:
> Is there any good examples (like commercial products or standards)
> using the frequency domain for efficient convolution/filtering?
>
> I guess that's what Bob's comment is about!?
>
> Jens
>
> I tend to agree with Bob's comment about doing convolution in the
> frequency domain. Even though it's theoretically more efficient than
> the time domain operation in certain cases I'm not familiar with any
> practical implementations

Oops!  My comment was with respect to doing non-extended,
circular convolution.  Doing convolution in the frequency
domain with overlap add/save is _very_ efficient for long
kernels compared to doing it in the time domain.

Bob
--

"Things should be described as simply as possible, but no
simpler."

A. Einstein
```
```
Mark Borgerding wrote:
> I disagree with the conclusion.  Jens, at least you are more diplomatic
> in your opinions.  Bob, you shouldn't claim something is useless, just
> because you can't find a use for it.

Sorry my terse comment was ambiguous.  See my response to Jens.

Bob
--

"Things should be described as simply as possible, but no
simpler."

A. Einstein
```
```Mark Borgerding wrote:
> I disagree with the conclusion.  Jens, at least you are more diplomatic
> in your opinions.  Bob, you shouldn't claim something is useless, just
> because you can't find a use for it.
>
> I can see how you'd form your opinion *if* you've mainly worked in the
> fixed-point world.
>
> I assume everyone here knows that fast convolution filtering becomes
> increasingly more efficient for longer filter lengths.  Below some
> crossover point, direct convolution is more efficient. Above that point,
> fast convolution is quicker. I will not spend time talking about this
> point, since I doubt it is contested.
>
>
> For a fixed point DSP processor, fast convolution has a couple of
> strikes against it:
> 1) lack of dynamic range in a fixed-point FFT. This prohibits large FFTs
> required for large filters
> 2) DSP chips are already optimized for direct FIR compution/correlation
>    Most (all) DSP chips have a single cycle multiply-accumulate
> operation.  This moves the crossover point for direct vs. fast
> convolution much higher than the theoretical 25-30 taps.
> 3) Such systems are usually memory-poor.
> 4) For the applications built on fixed-point DSPs, latency requirements
> are just as important than throughput requirements.
>
> These factors act to squeeze down the useful range of fast convolution.
>   The filters cannot be too big, because the FFT turns to mush.  For a
> small filter, you'd be better off using direct computation. ON A FIXED
> POINT PLATFORM.!!!!
>
> On the other hand:
> For floating point on a desktop/server processor, fast convolution is my
> filtering tool of choice.  Unless you are using REALLY short filters,
> you'd probably be better off with fast convolution.
>
> 1) The dynamic range of a decent floating point FFT is not really an issue.
> 2) Efficient SIMD FFT libraries such as FFTW or IPP make the crossover
> point for direct vs fast filtering much lower than the theoretical 25-30
> taps, more like 16 taps.
> 3) The extra memory required for fast convolution is negligible on a
> desktop/server.
> 4) desktop/server DSP applications are often file-based. For these,
> latency is a non-issue.
>
>
> -- Mark
>
>

Thanks for a thoughtful posting, Mark. I would add one more item to

5) Optimized FFT libraries for fixed-point DSP chips are usually
limited to powers of two up to 1k or so.

I would also point out that modern fixed-point DSP chips can directly
convolve in less than one clock per cycle, either by using specialized
instructions for symmetric coefficients or parallel MAC units.

If there is a rate change in the FIR, the crossover point moves yet
again because the linearly reduced (increased) MAC load has to be
compared to the reduced (increased) IFFT size.

John

```
```john wrote:

> Thanks for a thoughtful posting, Mark. I would add one more item to
> your strike list:
>
> 5) Optimized FFT libraries for fixed-point DSP chips are usually
> limited to powers of two up to 1k or so.

The power-of-2 limitation is not really an issue for fast convolution.
But the overall size limitation certainly sets an upper bound on filter
lengths.

>
> I would also point out that modern fixed-point DSP chips can directly
> convolve in less than one clock per cycle, either by using specialized
> instructions for symmetric coefficients or parallel MAC units.

Good point.  I should've said, "single cycle multiply-accumulate
*or* *better*"

>
> If there is a rate change in the FIR, the crossover point moves yet
> again because the linearly reduced (increased) MAC load has to be
> compared to the reduced (increased) IFFT size.

Interesting.

This effect would be observed regardless of floating vs. fixed point.

It can be mitigated by the fact that the forward & inverse FFTs don't
need to be the same size.  For details, keep an eye out for my article
in next May's issue of IEEE Signal Processing Magazine. </PLUG>

If I am thinking clearly, the direction that the crossover point moves
would depend on the direction of the rate change (upsampling vs.
downsampling).  Upsampling should favor fast convolution while
downsampling should favor direct conv.

Let me try to spell out my thoughts.

The following compares the cpu cost(work) involved non-rate-changing
filtering vs.
1-stage (not-multirate) interpolator or decimator
(complete with anti-aliasing)
with either direct or fast convolution.

With direct convolution:
decimation by D
anti-aliasing filter is designed at higher (input) rate
scales the processor work by 1/D  (computed for each output sample)
interpolation by I
anti-aliasing filter is designed at higher (output) rate
scales the processor work by I (again, for each output sample)

With fast convolution
decimation by D
anti-aliasing filter is designed at higher (input) rate
forward FFT same size (as no r.c.), inverse FFT is 1/D size
scales work by roughly .5 + .5/D
i.e. you can *never* get rid of the first half of the work.
interpolation by I
anti-aliasing filter only needs to be designed at *lower* (input) rate
fwd FFT same size,  inv FFT I times bigger (zero padded).
scales work by roughly .5 + .5*I

>
> John
>
Good to hear from you! Drop me a line. I only have your work email.

-- Mark Borgerding
```
```Mark Borgerding wrote:
> john wrote:
>
> > Thanks for a thoughtful posting, Mark. I would add one more item to
> > your strike list:
> >
> > 5) Optimized FFT libraries for fixed-point DSP chips are usually
> > limited to powers of two up to 1k or so.
>
> The power-of-2 limitation is not really an issue for fast convolution.
> But the overall size limitation certainly sets an upper bound on filter
> lengths.

For a non-power-of-2 rate change, the IFFT might not be availabe
though.

>
> >
> > I would also point out that modern fixed-point DSP chips can directly
> > convolve in less than one clock per cycle, either by using specialized
> > instructions for symmetric coefficients or parallel MAC units.
>
>
> Good point.  I should've said, "single cycle multiply-accumulate
> *or* *better*"
>
> >
> > If there is a rate change in the FIR, the crossover point moves yet
> > again because the linearly reduced (increased) MAC load has to be
> > compared to the reduced (increased) IFFT size.
>
> Interesting.
>
> This effect would be observed regardless of floating vs. fixed point.
>
> It can be mitigated by the fact that the forward & inverse FFTs don't
> need to be the same size.  For details, keep an eye out for my article
> in next May's issue of IEEE Signal Processing Magazine. </PLUG>
>
> If I am thinking clearly, the direction that the crossover point moves
> would depend on the direction of the rate change (upsampling vs.
> downsampling).  Upsampling should favor fast convolution while
> downsampling should favor direct conv.
>
> Let me try to spell out my thoughts.
>
> The following compares the cpu cost(work) involved non-rate-changing
> filtering vs.
>    1-stage (not-multirate) interpolator or decimator
>   (complete with anti-aliasing)
>   with either direct or fast convolution.
>
> With direct convolution:
>    decimation by D
>      anti-aliasing filter is designed at higher (input) rate
>      scales the processor work by 1/D  (computed for each output sample)
>    interpolation by I
>      anti-aliasing filter is designed at higher (output) rate
>      scales the processor work by I (again, for each output sample)
>
> With fast convolution
>    decimation by D
>      anti-aliasing filter is designed at higher (input) rate
>      forward FFT same size (as no r.c.), inverse FFT is 1/D size
>       scales work by roughly .5 + .5/D
> 	i.e. you can *never* get rid of the first half of the work.
>    interpolation by I
>      anti-aliasing filter only needs to be designed at *lower* (input) rate
>      fwd FFT same size,  inv FFT I times bigger (zero padded).
>      scales work by roughly .5 + .5*I
>
> >
> > John
> >
> Good to hear from you! Drop me a line. I only have your work email.
>
>
> -- Mark Borgerding

The work scales down in both the D and I cases (I misspoke on this
point). In the former case every Dth input is clocked into a
full-length MAC loop. In the latter case all inputs are clocked into a
MAC loop that is reduced by I by eliminating the 0*tap multiplies
(using the appropriate tap subset each time).

As you know, the convolution method can be attractive where a
fractional rate change is desired.

John

```