DSPRelated.com
Forums

FFT convolution with overlap add

Started by Uli December 13, 2004
Hi there,

I need some advise for a convolution problem:

I have to compute several thousands of convolutions. I managed to do it
with the overlap add method described in chapter 18 of the dspguide.
But it takes too much time for computation.

The method described is breaking a signal with a big number of points
into small segments. Then it is convoluted by a small filter.

In my case I have to convolute 16384 and 65536 points. So I did it by
breaking 65536 points into 4 segments and used a 32768 point FFT.

It seems I need to segment both inputs of the convolution. How to do
that? Is there a formula or an example program?

Regards

Uli

Uli wrote:
> Hi there, > > I need some advise for a convolution problem: > > I have to compute several thousands of convolutions. I managed to do
it
> with the overlap add method described in chapter 18 of the dspguide. > But it takes too much time for computation. > > The method described is breaking a signal with a big number of points > into small segments. Then it is convoluted by a small filter. > > In my case I have to convolute 16384 and 65536 points. So I did it by > breaking 65536 points into 4 segments and used a 32768 point FFT. > > It seems I need to segment both inputs of the convolution. How to do > that? Is there a formula or an example program? > > Regards > > Uli
I guess you have already tried frequency domain FFT based convolution which could be faster than time domain linear convilution. The problem I see here is input-output latency(delay) which adds up and also buffering is required a prior. A better method has been suggested by Gardner which is sometimes known as zero-delay alogorithm - See below: W. G. Gardner, "Efficient Convolution without Input- Output Delay", J. Audio Eng. Soc., vol. 43, no. 3, pp. 127- 136, March 1995. Hope that helps a little. santosh
On 2004-12-14 06:34:17 +0100, "Nilnod" <santosh.nath@ntlworld.com> said:

> A better method has been suggested by Gardner which is sometimes known > as > zero-delay alogorithm - See below:
I think there's a patent on that by Lake DSP, iirc (even though I sometimes wonder why noone has patented multiplication and addition yet if it's possible to patent basic stuff like that). You might wish to look it up, too. -- Stephan M. Bernsee http://www.dspdimension.com
Uli wrote:
> Hi there, > > I need some advise for a convolution problem: > > I have to compute several thousands of convolutions. I managed to do
it
> with the overlap add method described in chapter 18 of the dspguide. > But it takes too much time for computation. > > The method described is breaking a signal with a big number of points > into small segments. Then it is convoluted by a small filter. > > In my case I have to convolute 16384 and 65536 points. So I did it by > breaking 65536 points into 4 segments and used a 32768 point FFT. > > It seems I need to segment both inputs of the convolution. How to do > that? Is there a formula or an example program?
I am not sure I understand what the problem is. Is this 16384 + 65536 system one of the several thousands of operations you need to do? If so, how are the datasets interconnected? What are the time constraints on your computations? Memory? What computer architecture do you use? On my 1.4 GHz desktop PC with 1 GB RAM the 65536 pt FFT (no overlap add) takes some 0.3 seconds under matlab 7.0. Is that very different from what your system comes up with? Let me speculate a bit: If you have to compute several thousands 65536 + 16384 convolutions, it will take a lot of time on a single-CPU computer. With the numbers cited above, 1000 65536 pt FFTs will take 30 seconds. You need two FFTs and one IFFT per convolution operation, 90 seconds. Moving data back and forth between disk and memory will come in addition to that, say 15 seconds per 1000 data sequences. So we are talking on the order of 1 - 2 minutes CPU time per 1000 convolutions. If you have 10000, 20000 or 100000 such operations, you'll get very bored while waiting for the thing to run through. If my speculations above about how you work are correct, the only thing that might speed up your program (except for a faster CPU) is parallel processing on a multi-CPU system. The above structure of the problem is typical for seismic processing: Each sub-problem is relatively small, but the shear number of problems makes the computational task overwhelming. What saves the day is that the problems are decoupled and can be handled in parallel. If the data have to be treated in sequence... Rune
Stephan M. Bernsee wrote:
> [snip] > > I think there's a patent on that by Lake DSP, iirc (even though I > sometimes wonder why noone has patented multiplication and addition yet > if it's possible to patent basic stuff like that). >
IIRC in early 70's someone attempted to patent using color on a computer display ;/
Hi Rune,

yes, i need 32768 operations of convolution of 16384 by 65536 points.
Because I'm not an expert I used algorithms of the dspguide. And a
calculation like this takes days on my 1 gHz laptop.
Thus I studied further and reduced the FFT size from 131072 points
downto 32768 by segmenting the 65536 points into four portions and
using FFT with overlap add. So now the computation with the smaller FFT
size takes 10 hours.

So my idea was to reduce the FFT size more. But the biggest input has
the main influence. So I think if it is possible to split e.g. the
16384 points to 8 x 2048 and the 65536 points to 32 x 2048 and use a
4096 point FFT should reduce the overall calculation time. Because it
is a static calculation latency or delay is not the problem. Only the
total time.

A parallel computer is not possible, I have only a single-CPU PC. I
also do not know how to code it in the right way.

Uli

Uli wrote:
> Hi Rune, > > yes, i need 32768 operations of convolution of 16384 by 65536 points. > Because I'm not an expert I used algorithms of the dspguide. And a > calculation like this takes days on my 1 gHz laptop. > Thus I studied further and reduced the FFT size from 131072 points > downto 32768 by segmenting the 65536 points into four portions and > using FFT with overlap add. So now the computation with the smaller FFT > size takes 10 hours. > > So my idea was to reduce the FFT size more. But the biggest input has > the main influence. So I think if it is possible to split e.g. the > 16384 points to 8 x 2048 and the 65536 points to 32 x 2048 and use a > 4096 point FFT should reduce the overall calculation time. Because it > is a static calculation latency or delay is not the problem. Only the > total time. > > A parallel computer is not possible, I have only a single-CPU PC. I > also do not know how to code it in the right way. > > Uli >
What FFT library do you use ? As Rune said this could long, but not days unless you have a very inefficient FFT. For fun I tried to hack this problem up in Numerical Python (script below). I have a 750 MHz Duron so almost anything should be able to do this faster. On my machine each convolution takes a little more than halv a second, so estimated time for 32768 convolutions would be 4 hrs. You are not trying to hold all in RAM at once are you ? In that case you will swap a lot, and that will slow the machine to a halt. -- Brian Python script: from Numeric import * from FFT import * from random import random from time import time la=65536 lb=16384 lt=la+lb a=array([0.]*lt) b=array([0.]*lt) # Do this 32768 times for k in range(32768): # Form two random sets to convolve s1=array([random() for x in range(la)]) s2=array([random() for x in range(lb)]) a[:la]=s1 b[:lb]=s2 t=time() r=inverse_fft(fft(a)*fft(b)) print "Time: %f" % (time()-t)
Brian Dam Pedersen wrote:

> Uli wrote: > >> Hi Rune, >> >> yes, i need 32768 operations of convolution of 16384 by 65536 points. >> Because I'm not an expert I used algorithms of the dspguide. And a >> calculation like this takes days on my 1 gHz laptop. >> Thus I studied further and reduced the FFT size from 131072 points >> downto 32768 by segmenting the 65536 points into four portions and >> using FFT with overlap add. So now the computation with the smaller FFT >> size takes 10 hours. >> >> So my idea was to reduce the FFT size more. But the biggest input has >> the main influence. So I think if it is possible to split e.g. the >> 16384 points to 8 x 2048 and the 65536 points to 32 x 2048 and use a >> 4096 point FFT should reduce the overall calculation time. Because it >> is a static calculation latency or delay is not the problem. Only the >> total time. >> >> A parallel computer is not possible, I have only a single-CPU PC. I >> also do not know how to code it in the right way. >> >> Uli >> > What FFT library do you use ? As Rune said this could long, but not > days unless you have a very inefficient FFT. For fun I tried to hack > this problem up in Numerical Python (script below). I have a 750 MHz > Duron so almost anything should be able to do this faster. On my > machine each convolution takes a little more than halv a second, so > estimated time for 32768 convolutions would be 4 hrs. You are not > trying to hold all in RAM at once are you ? In that case you will swap > a lot, and that will slow the machine to a halt. > > -- Brian
(snip) Uli, I agree that the FFT you are using sounds very inefficient. You could check the method of accessing the FFT 'twiddle factors.' The twiddle factors (phase rotations) should be calculated once, and stored in a table The FFT Butterflies should access the table, and not waste time calling cos and sin functions each time a twiddle factor is needed. Regards, John
Uli wrote:
> Hi Rune, > > yes, i need 32768 operations of convolution of 16384 by 65536 points. > Because I'm not an expert I used algorithms of the dspguide. And a > calculation like this takes days on my 1 gHz laptop. > Thus I studied further and reduced the FFT size from 131072 points > downto 32768 by segmenting the 65536 points into four portions and > using FFT with overlap add. So now the computation with the smaller
FFT
> size takes 10 hours. > > So my idea was to reduce the FFT size more. But the biggest input has > the main influence. So I think if it is possible to split e.g. the > 16384 points to 8 x 2048 and the 65536 points to 32 x 2048 and use a > 4096 point FFT should reduce the overall calculation time. Because it > is a static calculation latency or delay is not the problem. Only the > total time. > > A parallel computer is not possible, I have only a single-CPU PC. I > also do not know how to code it in the right way.
You haven't really explained what problem you try to solve. Now that you tell us that you experiment with different FFT lengths, I don't see off the top of my head what kind of problem you are working with. Could you provide some details about what type of signal you work with, and what you try to achieve? If you try to filter one very long data sequence with a fixed filter, you compute the DFT of the filter once and store it. That way, you save some 30% run-time. Anyway, from your stated numbers, I get that you have 32768 * 65536 = 2.2 billion samples. Assume these are stored on single-precision floating point format, and you have 8 GB of data to handle. That doesn't fit into memory of your computer, so with the naive approach, you just get stuck swapping data. From what you say about run/time, it sounds like this is what happens to you. So regardless of what you end up doing what the filtering is concerned, you HAVE to think seriously about data flow issues. What you need to do, is to find out how much RAM your computer has. I don't know how much RAM the OS needs to run, but assume you have 512 MB RAM, then assign not much more than 256 MB for the data. You need 128 MB as an input buffer and 128 MB as an output buffer, some space for internal variables and still leave enough room for the OS to do its job without it starting swapping. Now, these are conservative estimates, so you might be able to squeeze the numbers a bit. Keep in min that the objective of the whole exercise is to minimize the number of accesses to the disc, both on input and on output, while staying inside the RAM available. You don't say what computational environment you use. My estimates are based on using a reaonabely efficient C/C++ or FORTRAN implementation. All bets what run-time is concerned, are off if you use matlab. Matlab is notoriously inefficient for heavy-duty stuff like what you have here. To demonstrate, I have this test: % ================= Script looptest.m ============================ N=10000000; % Reduce if memory expires V=0:N-1; sum1=0; sw1=cputime; for n=1:N sum1=sum1+V(n); end cputime1=cputime-sw1; sw2=cputime; sum2=sum(V); cputime2=cputime-sw2; disp(['Run-time, computing the sum of ',int2str(N),' elements:']); disp(['For-loop : ',num2str(cputime1),' seconds']); disp(['Built-in function : ',num2str(cputime2),' seconds']); % ==== End looptest.m ============================================ The output on my computer is Run-time, computing the sum of 10000000 elements: For-loop : 36.437 seconds Built-in function : 0.203 seconds So there is an absurd overhead associated with loops in matlab. Rune
Hi all,

thanks first for all your comments.
As I honestly have said I'm a newbie to DSP. The book of
www.dspguide.com is now what I understood so far. And I have directly
used the code samples there. They are given in BASIC and I have
transferred them to DELPHI, my favourite programming language.
So I have declared arrays of the described sizes with double precision.


Some part of the convolution program:

procedure convolution(nr:integer);
var i,k: integer;
a,b,c,d: double;
begin
for i := 0 to high(olap) do olap[i] := 0;
for k := 0 to 3 do
begin
for i := 0 to 16384 do
begin
a := resignal[nr][k][i];
b := imsignal[nr][k][i];
c := refr[i];
d := imfr[i];
fft_rex[i] := a*c-b*d;
fft_imx[i] := a*d+b*c;
end;
for i := 16385 to high(fft_rex) do
begin
fft_rex[i] := 0;
fft_imx[i] := 0;
end;
fft_inversetransform;
for i := 0 to high(olap) do fft_rex[i] := fft_rex[i] + olap[i];
for i := 16384 to high(fft_rex) do olap[i-16384] := fft_rex[i];
for i := 0 to 16383 do y[i+k*16384]:= fft_rex[i];
end;
for i := 0 to high(olap) do y[i+65536] := olap[i];
end;

The 65536 points are split to four segments (k=0..3) 16384 points.
Because the 65536 points are constant for all the convolutions I have
of course calculated the FFTs before. Data are in resignal and
imsignal. The data in refr and imfr are changing with every
convolution. Thus I have to compute the FFT before. refr and imfr do
not use too much memory. Size of the FFT is 32768 points.

The application is to convolve two measured IR pulses (65536) with
32768 test signals (16384 points). The result is then further processed
by e.g. Root Mean Square.

Hope this clarifies your questions.
Uli