comp.dsp | FFT convolution with overlap add

Hi there,

I need some advise for a convolution problem:

I have to compute several thousands of convolutions. I managed to do it
with the overlap add method described in chapter 18 of the dspguide.
But it takes too much time for computation.

The method described is breaking a signal with a big number of points
into small segments. Then it is convoluted by a small filter.

In my case I have to convolute 16384 and 65536 points. So I did it by
breaking 65536 points into 4 segments and used a 32768 point FFT.

It seems I need to segment both inputs of the convolution. How to do
that? Is there a formula or an example program?

Regards

Uli

Reply by Nilnod ●December 14, 20042004-12-14

Uli wrote:
> Hi there,
>
> I need some advise for a convolution problem:
>
> I have to compute several thousands of convolutions. I managed to do
it
> with the overlap add method described in chapter 18 of the dspguide.
> But it takes too much time for computation.
>
> The method described is breaking a signal with a big number of points
> into small segments. Then it is convoluted by a small filter.
>
> In my case I have to convolute 16384 and 65536 points. So I did it by
> breaking 65536 points into 4 segments and used a 32768 point FFT.
>
> It seems I need to segment both inputs of the convolution. How to do
> that? Is there a formula or an example program?
>
> Regards
>
> Uli


I guess you have already tried frequency domain FFT based convolution
which could be faster than time domain linear convilution.
The problem I see here is input-output latency(delay) which adds up and
also buffering is required a prior.

A better method has been suggested by Gardner which is sometimes known
as
zero-delay alogorithm - See below:

W. G. Gardner, "Efficient Convolution without Input-
Output Delay", J. Audio Eng. Soc., vol. 43, no. 3, pp. 127-
136, March 1995.

Hope that helps a little.
santosh

Reply by Stephan M. Bernsee ●December 14, 20042004-12-14

On 2004-12-14 06:34:17 +0100, "Nilnod" <santosh.nath@ntlworld.com> said:

> A better method has been suggested by Gardner which is sometimes known
> as
> zero-delay alogorithm - See below:

I think there's a patent on that by Lake DSP, iirc (even though I 
sometimes wonder why noone has patented multiplication and addition yet 
if it's possible to patent basic stuff like that).

You might wish to look it up, too.
-- 
Stephan M. Bernsee
http://www.dspdimension.com

Reply by Rune Allnor ●December 14, 20042004-12-14

Uli wrote:
> Hi there,
>
> I need some advise for a convolution problem:
>
> I have to compute several thousands of convolutions. I managed to do
it
> with the overlap add method described in chapter 18 of the dspguide.
> But it takes too much time for computation.
>
> The method described is breaking a signal with a big number of points
> into small segments. Then it is convoluted by a small filter.
>
> In my case I have to convolute 16384 and 65536 points. So I did it by
> breaking 65536 points into 4 segments and used a 32768 point FFT.
>
> It seems I need to segment both inputs of the convolution. How to do
> that? Is there a formula or an example program?

I am not sure I understand what the problem is. Is this 16384 + 65536
system one of the several thousands of operations you need to do?

If so, how are the datasets interconnected? What are the time
constraints on your computations? Memory? What computer
architecture do you use?

On my 1.4 GHz desktop PC with 1 GB RAM the 65536 pt FFT (no overlap
add) takes some 0.3 seconds under matlab 7.0. Is that very different
from what your system comes up with?

Let me speculate a bit:

If you have to compute several thousands 65536 + 16384 convolutions,
it will take a lot of time on a single-CPU computer. With the numbers
cited above, 1000 65536 pt FFTs will take 30 seconds. You need two
FFTs and one IFFT per convolution operation, 90 seconds. Moving data
back and forth between disk and memory will come in addition to that,
say 15 seconds per 1000 data sequences. So we are talking on the
order of 1 - 2 minutes CPU time per 1000 convolutions. If you have
10000, 20000 or 100000 such operations, you'll get very bored
while waiting for the thing to run through.

If my speculations above about how you work are correct, the only
thing that might speed up your program (except for a faster CPU)
is parallel processing on a multi-CPU system. The above structure
of the problem is typical for seismic processing: Each sub-problem
is relatively small, but the shear number of problems makes the
computational task overwhelming. What saves the day is that the
problems are decoupled and can be handled in parallel. If the
data have to be treated in sequence...

Rune

Reply by Richard Owlett ●December 14, 20042004-12-14

Stephan M. Bernsee wrote:
> [snip]
> 
> I think there's a patent on that by Lake DSP, iirc (even though I 
> sometimes wonder why noone has patented multiplication and addition yet 
> if it's possible to patent basic stuff like that).
> 

IIRC in early 70's someone attempted to patent using color on a computer 
display ;/

Reply by Uli ●December 17, 20042004-12-17

Hi Rune,

yes, i need 32768 operations of convolution of 16384 by 65536 points.
Because I'm not an expert I used algorithms of the dspguide. And a
calculation like this takes days on my 1 gHz laptop.
Thus I studied further and reduced the FFT size from 131072 points
downto 32768 by segmenting the 65536 points into four portions and
using FFT with overlap add. So now the computation with the smaller FFT
size takes 10 hours.

So my idea was to reduce the FFT size more. But the biggest input has
the main influence. So I think if it is possible to split e.g. the
16384 points to 8 x 2048 and the 65536 points to 32 x 2048 and use a
4096 point FFT should reduce the overall calculation time. Because it
is a static calculation latency or delay is not the problem. Only the
total time.

A parallel computer is not possible, I have only a single-CPU PC. I
also do not know how to code it in the right way.

Uli

Reply by Brian Dam Pedersen ●December 19, 20042004-12-19

Uli wrote:
> Hi Rune,
> 
> yes, i need 32768 operations of convolution of 16384 by 65536 points.
> Because I'm not an expert I used algorithms of the dspguide. And a
> calculation like this takes days on my 1 gHz laptop.
> Thus I studied further and reduced the FFT size from 131072 points
> downto 32768 by segmenting the 65536 points into four portions and
> using FFT with overlap add. So now the computation with the smaller FFT
> size takes 10 hours.
> 
> So my idea was to reduce the FFT size more. But the biggest input has
> the main influence. So I think if it is possible to split e.g. the
> 16384 points to 8 x 2048 and the 65536 points to 32 x 2048 and use a
> 4096 point FFT should reduce the overall calculation time. Because it
> is a static calculation latency or delay is not the problem. Only the
> total time.
> 
> A parallel computer is not possible, I have only a single-CPU PC. I
> also do not know how to code it in the right way.
> 
> Uli
> 
What FFT library do you use ? As Rune said this could long, but not days 
unless you have a very inefficient FFT. For fun I tried to hack this 
problem up in Numerical Python (script below). I have a 750 MHz Duron so 
almost anything should be able to do this faster. On my machine each 
convolution takes a little more than halv a second, so estimated time 
for 32768 convolutions would be 4 hrs. You are not trying to hold all in 
RAM at once are you ? In that case you will swap a lot, and that will 
slow the machine to a halt.

-- Brian

Python script:

from Numeric import *
from FFT import *
from random import random
from time import time

la=65536
lb=16384
lt=la+lb

a=array([0.]*lt)
b=array([0.]*lt)


# Do this 32768 times
for k in range(32768):
     # Form two random sets to convolve
     s1=array([random() for x in range(la)])
     s2=array([random() for x in range(lb)])
     a[:la]=s1
     b[:lb]=s2
     t=time()
     r=inverse_fft(fft(a)*fft(b))
     print "Time: %f" % (time()-t)

Reply by John Monro ●December 19, 20042004-12-19

Brian Dam Pedersen wrote:

> Uli wrote:
>
>> Hi Rune,
>>
>> yes, i need 32768 operations of convolution of 16384 by 65536 points.
>> Because I'm not an expert I used algorithms of the dspguide. And a
>> calculation like this takes days on my 1 gHz laptop.
>> Thus I studied further and reduced the FFT size from 131072 points
>> downto 32768 by segmenting the 65536 points into four portions and
>> using FFT with overlap add. So now the computation with the smaller FFT
>> size takes 10 hours.
>>
>> So my idea was to reduce the FFT size more. But the biggest input has
>> the main influence. So I think if it is possible to split e.g. the
>> 16384 points to 8 x 2048 and the 65536 points to 32 x 2048 and use a
>> 4096 point FFT should reduce the overall calculation time. Because it
>> is a static calculation latency or delay is not the problem. Only the
>> total time.
>>
>> A parallel computer is not possible, I have only a single-CPU PC. I
>> also do not know how to code it in the right way.
>>
>> Uli
>>
> What FFT library do you use ? As Rune said this could long, but not 
> days unless you have a very inefficient FFT. For fun I tried to hack 
> this problem up in Numerical Python (script below). I have a 750 MHz 
> Duron so almost anything should be able to do this faster. On my 
> machine each convolution takes a little more than halv a second, so 
> estimated time for 32768 convolutions would be 4 hrs. You are not 
> trying to hold all in RAM at once are you ? In that case you will swap 
> a lot, and that will slow the machine to a halt.
>
> -- Brian

(snip)

Uli,
I agree that the FFT you are using sounds very inefficient. You could 
check the method of accessing the FFT 'twiddle factors.' 
The twiddle factors (phase rotations) should be calculated once, and 
stored in  a table  The FFT Butterflies should access the table, and not 
waste time calling cos and sin functions each time a twiddle factor is 
needed.

Regards,
John

Reply by Rune Allnor ●December 19, 20042004-12-19

Uli wrote:
> Hi Rune,
>
> yes, i need 32768 operations of convolution of 16384 by 65536 points.
> Because I'm not an expert I used algorithms of the dspguide. And a
> calculation like this takes days on my 1 gHz laptop.
> Thus I studied further and reduced the FFT size from 131072 points
> downto 32768 by segmenting the 65536 points into four portions and
> using FFT with overlap add. So now the computation with the smaller
FFT
> size takes 10 hours.
>
> So my idea was to reduce the FFT size more. But the biggest input has
> the main influence. So I think if it is possible to split e.g. the
> 16384 points to 8 x 2048 and the 65536 points to 32 x 2048 and use a
> 4096 point FFT should reduce the overall calculation time. Because it
> is a static calculation latency or delay is not the problem. Only the
> total time.
>
> A parallel computer is not possible, I have only a single-CPU PC. I
> also do not know how to code it in the right way.

You haven't really explained what problem you try to solve. Now that
you tell us that you experiment with different FFT lengths, I don't
see off the top of my head what kind of problem you are working with.
Could you provide some details about what type of signal you work
with, and what you try to achieve?

If you try to filter one very long data sequence with a fixed filter,
you compute the DFT of the filter once and store it. That way, you
save some 30% run-time.

Anyway,  from your stated numbers, I get that you have
32768 * 65536 = 2.2 billion samples. Assume these are stored on
single-precision floating point format, and you have 8 GB of data
to handle. That doesn't fit into memory of your computer, so with
the naive approach, you just get stuck swapping data. From
what you say about run/time, it sounds like this is what happens
to you.

So regardless of what you end up doing what the filtering is
concerned, you HAVE to think seriously about data flow issues.
What you need to do, is to find out how much RAM your computer has.
I don't know how much RAM the OS needs to run, but assume you have
512 MB RAM, then assign not much more than 256 MB for the data.
You need 128 MB as an input buffer and 128 MB as an output buffer,
some space for internal variables and still leave enough room for
the OS to do its job without it starting swapping.

Now, these are conservative estimates, so you might be able to
squeeze the numbers a bit. Keep in min that the objective of the
whole exercise is to minimize the number of accesses to the disc,
both on input and on output, while staying inside the RAM available.

You don't say what computational environment you use. My estimates
are based on using a reaonabely efficient C/C++ or FORTRAN
implementation. All bets what run-time is concerned, are off if you
use matlab. Matlab is notoriously inefficient for heavy-duty stuff
like what you have here.

To demonstrate, I have this test:

% ================= Script looptest.m ============================

N=10000000; % Reduce if memory expires
V=0:N-1;

sum1=0;
sw1=cputime;
for n=1:N
sum1=sum1+V(n);
end
cputime1=cputime-sw1;

sw2=cputime;
sum2=sum(V);
cputime2=cputime-sw2;

disp(['Run-time, computing the sum of ',int2str(N),' elements:']);
disp(['For-loop          : ',num2str(cputime1),' seconds']);
disp(['Built-in function : ',num2str(cputime2),' seconds']);
% ==== End looptest.m ============================================

The output on my computer is

Run-time, computing the sum of 10000000 elements:
For-loop          : 36.437 seconds
Built-in function : 0.203 seconds
So there is an absurd overhead associated with loops in matlab. 

Rune

Reply by Uli ●December 19, 20042004-12-19

Hi all,

thanks first for all your comments.
As I honestly have said I'm a newbie to DSP. The book of
www.dspguide.com is now what I understood so far. And I have directly
used the code samples there. They are given in BASIC and I have
transferred them to DELPHI, my favourite programming language.
So I have declared arrays of the described sizes with double precision.


Some part of the convolution program:

procedure convolution(nr:integer);
var i,k: integer;
a,b,c,d: double;
begin
for i := 0 to high(olap) do olap[i] := 0;
for k := 0 to 3 do
begin
for i := 0 to 16384 do
begin
a := resignal[nr][k][i];
b := imsignal[nr][k][i];
c := refr[i];
d := imfr[i];
fft_rex[i] := a*c-b*d;
fft_imx[i] := a*d+b*c;
end;
for i := 16385 to high(fft_rex) do
begin
fft_rex[i] := 0;
fft_imx[i] := 0;
end;
fft_inversetransform;
for i := 0 to high(olap) do fft_rex[i] := fft_rex[i] + olap[i];
for i := 16384 to high(fft_rex) do olap[i-16384] := fft_rex[i];
for i := 0 to 16383 do y[i+k*16384]:= fft_rex[i];
end;
for i := 0 to high(olap) do y[i+65536] := olap[i];
end;

The 65536 points are split to four segments (k=0..3) 16384 points.
Because the 65536 points are constant for all the convolutions I have
of course calculated the FFTs before. Data are in resignal and
imsignal. The data in refr and imfr are changing with every
convolution. Thus I have to compute the FFT before. refr and imfr do
not use too much memory. Size of the FFT is 32768 points.

The application is to convolve two measured IR pulses (65536) with
32768 test signals (16384 points). The result is then further processed
by e.g. Root Mean Square.

Hope this clarifies your questions.
Uli

Previous12 Next

FFT convolution with overlap add

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group