DSPRelated.com
Forums

Convolution versus LPFiltering

Started by XED February 22, 2004
Andrew Reilly wrote:

> Hi Jerry, > > I had noticed this thread, but had not paid any attention until now. And > now I can't see the earlier messages. Well, I could google, but you know...
...
> By "transversal filter", you mean "Direct-form FIR filter", don't you? So > what do you mean by "start-up transient"? Isn't that a characteristic of > the response/coefficients, rather than the implementation?
Some recursive filters have finite impulse responses. I wanted to be explicit. I could just as well have written "non-recursive FIR". The start-up transient is the filter's output before it has filled with samples. When you think about it, you can see that a filter's output can only be accurate when all the samples in it are valid ones, not zeros or whatever its initial contents might have been when the signal began. There is another transient when the signal stops, but that's evident only when fabricated samples are fed into the filter to "squeeze the last toothpaste out of the tube."
> Since a transversal, or direct-form FIR filter has no inherent limitations > other than causality, and Mark was not suggesting that you could > overlap-save with less delay than the blocking factor, I'm not sure what > the source of your skepticism is? Could you elaborate?
It seemed to me that he might have been saying that. I will wait for enlightenment. It's quite likely that I don't yet understand.
> Mark: how does changing where the "save" part of the result appears in the > ifft buffer "avoid a buffer copy"? Presumably you have to at least copy > the data into an I/O unit or buffer at some stage, anyway.
Time will tell. I didn't try to analyze the code.
> Cheers,
Jerry -- Engineering is the art of making what you want from things you can get. �����������������������������������������������������������������������
Jerry Avins wrote:
> Mark Borgerding wrote: > >>>> >>>> FWIW, >>>> You can get rid of the startup transient (or rather shift it to the >>>> buffer end) by rotating the zero padded filter impulse response to >>>> the left by nh-1 samples. This moves the "scrap" samples from the >>>> front of the output buffer to the back.
[snip]
>> Any fast convolution filtering method introduces a delay due to the >> buffering of input samples. ( Except for the ludicrous case with one >> fft->ifft operation for each sample. ) >> >> The overlap-scrap-tail method does not introduce any delay beyond that >> already imposed by fast-convolution filtering. Does that answer your >> question?
[snip]
>> Here's tailscrap.m >> ============================================ >> % test code for circular convolution with the scrapped portion >> % at the tail of the buffer, rather than the front >> % >> % The idea is to rotate the zero-padded h (impulse response) buffer >> % to the left nh-1 samples, rotating the junk samples as well. >> % This could be very handy in avoiding buffer copies during fast >> filtering. >> nh=10; >> nfft=256; >> >> h=rand(1,nh); >> x=rand(1,nfft); >> >> hpad=[ h(nh) zeros(1,nfft-nh) h(1:nh-1) ]; >> >> % baseline comparison >> y1 = filter(h,1,x); >> y1_notrans = y1(nh:nfft); >> >> % fast convolution >> y2 = ifft( fft(hpad) .* fft(x) ); >> y2_notrans=y2(1:nfft-nh+1); >> >> maxabsdiff = max(abs(y2_notrans - y1_notrans)) > > > It looks interesting. The fourth conservation law, Conservation of > Inherent Limitation, makes me skeptical that a shorter start-up > transient than that imposed by a transversal filter is possible. I > hope the limitation isn't really inherent. I eagerly await the PDF. > > Jerry
Aww. To heck with the pdf. I think I can explain it using an example and ascii-art. ( I'll try to choose my words carefully, but I will probably say a thing or two worth correcting.) The transient response is still there. It's the last nh-1 samples of the output buffer. The invalid output samples get shifted there by a linear phase multiplication in the frequency domain (i.e convolution with delta pulse in the time domain ). Here's an example with nh=3 and nfft=8. Both with classical overlap-scrap (aka overlap save) processing, and overlap-scrap-tail (for lack of a better name) processing. I won't elaborate on the general theory of overlap-save/scrap. Any DSP book will do a better job explaining than I would. I'm just trying to explain the differences. *********** Classical overlap-scrap ***** circular convolution of impulse response [ h0 h1 h2 ] with signal xvec=[ x0 x1 x2 x3 x4 x5 x6 x7 ]. The impulse response is padded to the same length as xvec. hpad=[ h0 h1 h2 0 0 0 0 0 ] The circular convolution of the two sequences is cc=ifft( fft(hpad) .* fft(xvec) ) cc0 == h0x0 + h1x7 + h2x6 or viewed as sliding windows ... 0 0 h2 h1 h0 ( ... x4 x5 x6 x7)x0 x1 x2 x3 x4 x5 x6 x7 cc2 is the first usable sample (i.e. no edge transients ) ... 0 0 h2 h1 h0 ( ... x6 x7)x0 x1 x2 x3 x4 x5 x6 x7 cc7 is the last usable sample ... 0 0 h2 h1 h0 x0 x1 x2 x3 x4 x5 x6 x7 So cc2 thru cc7 are valid, filtered output samples. ********** overlap-scrap-tail method **** Pad the impulse response as before. Rotate the padded impulse response to the left by nh-1 (i.e. 2). The rotated, padded impulse response is hpadrot = [ h2 0 0 0 0 0 h0 h1 ] As before, circularly convolve using fft. ccr = ifft( fft(hpadrot) .* fft(xvec) ) ccr0 is the first usable sample, ccr0 == h2x0 + h0x2 + h1x1 h1 h0 0 0 0 0 0 h2 x0 x1 x2 x3 x4 x5 x6 x7|x0 x1 x2 x3 x4 x5 x6 x7 The last usable sample is ccr5 == h2x5 + h0x7 + h1x6 h1 h0 0 0 0 0 0 h2 x0 x1 x2 x3 x4 x5 x6 x7|x0 x1 x2 x3 x4 x5 x6 x7 ccr0 thru ccr5 are the valid output samples. They correspond to cc2 thru cc7 in the classical method. The last nh-1 samples from the OST method are the "scrap" samples, corresponding to the first nh-1 samples from the classical OS method. Or in general ccr(n) = cc(n+nh-1), with wrap-around at edges Classical Overlap-Scrap filtering discards the samples produced by the circular nature of fft convolution. In contrast, the Overlap-Scrap-Tail method exploits the circular nature to shift the startup transient to the end. This allows the ifft stage to place its output directly into an output buffer* . The transient tail samples can then be overwritten by the next ifft. *if there is sufficient space for the transient response at the tail Other than this buffer copy, the computational requirements for either method are identical. -- Mark Borgerding
Andrew Reilly wrote:
> > Mark: how does changing where the "save" part of the result appears in the > ifft buffer "avoid a buffer copy"? Presumably you have to at least copy > the data into an I/O unit or buffer at some stage, anyway.
It allows you to overwrite one block's transient response with the valid samples from the next block. One can achieve speed gains whenever it is efficient to write more than nfft-nh+1 output samples at a time. Buffering data reduces system calls. This buffering is a freebie if you can ifft directly to your output buffer. Memory-mapped file access should also benefit from this. DMA may be another example (not 100% sure, tho). In my tests under linux it gave a modest gain. About 3-5%. The ifft results are written to the output buffer with nh-1 overlap. [ valid block0 + trans ] [ valid block 1 + trans ] [ valid block 2 ... In the classical overlap-scrap implementation, the invalid samples are at the front of the ifft result. The valid samples would have to be copied to the output buffer. -- Mark Borgerding
In article glU_b.7335$OE4.4169@fe1.columbus.rr.com, Mark Borgerding at
mark@borgerding.net wrote on 02/24/2004 22:09:

> Andrew Reilly wrote: >> >> Mark: how does changing where the "save" part of the result appears in the >> ifft buffer "avoid a buffer copy"? Presumably you have to at least copy >> the data into an I/O unit or buffer at some stage, anyway. > > It allows you to overwrite one block's transient response with the valid > samples from the next block. > > One can achieve speed gains whenever it is efficient to write more than > nfft-nh+1 output samples at a time. Buffering data reduces system > calls. This buffering is a freebie if you can ifft directly to your > output buffer. > > Memory-mapped file access should also benefit from this. > > DMA may be another example (not 100% sure, tho). > > In my tests under linux it gave a modest gain. About 3-5%. > > > > The ifft results are written to the output buffer with nh-1 overlap. > > [ valid block0 + trans ] > [ valid block 1 + trans ] > [ valid block 2 ...
we call that "overlap-add". before this thread, i have never heard of "overlap-scrap" and thought until now it meant the "overlap-save" method.
> In the classical overlap-scrap implementation, the invalid samples are > at the front of the ifft result. The valid samples would have to be > copied to the output buffer.
maybe it *is* the overlap-save method by another name. is that "trans" getting added to the next "valid block" before output? r b-j
robert bristow-johnson wrote:
> In article glU_b.7335$OE4.4169@fe1.columbus.rr.com, Mark Borgerding at > mark@borgerding.net wrote on 02/24/2004 22:09: > > >>Andrew Reilly wrote: >> >>>Mark: how does changing where the "save" part of the result appears in the >>>ifft buffer "avoid a buffer copy"? Presumably you have to at least copy >>>the data into an I/O unit or buffer at some stage, anyway. >> >>It allows you to overwrite one block's transient response with the valid >>samples from the next block. >> >>One can achieve speed gains whenever it is efficient to write more than >>nfft-nh+1 output samples at a time. Buffering data reduces system >>calls. This buffering is a freebie if you can ifft directly to your >>output buffer. >> >>Memory-mapped file access should also benefit from this. >> >>DMA may be another example (not 100% sure, tho). >> >>In my tests under linux it gave a modest gain. About 3-5%. >> >> >> >>The ifft results are written to the output buffer with nh-1 overlap. >> >>[ valid block0 + trans ] >> [ valid block 1 + trans ] >> [ valid block 2 ... > > > we call that "overlap-add". before this thread, i have never heard of > "overlap-scrap" and thought until now it meant the "overlap-save" method.
This is not the overlap-add method. It does not zero pad the signal. It does not add transients to the previous iff buffer. O-A is generally slightly less efficient than O-S because it has the extra addition step in each block. Overlap-scrap and overlap-save are two names given to the same technique. I prefer the former b/c it is more descriptive of what happens with the transient response -- i.e. the "add" in the overlap-add. In Overlap-add, the transient response is added (to the previous out block). In Overlap-scrap, the transient response is scrapped (the "saving" is done at the other end of the pipeline). The name by which it is known it is not important. FWIW, I think I first saw the name Overlap-Scrap in the Frerking comms book.
> > >>In the classical overlap-scrap implementation, the invalid samples are >>at the front of the ifft result. The valid samples would have to be >>copied to the output buffer. > > > maybe it *is* the overlap-save method by another name. is that "trans" > getting added to the next "valid block" before output?
Nope, it gets scrapped. Hmm, maybe my preference for the name "Overlap-Scrap" has more sophomoric roots. "That part's scrap." "What's crap?" "This is scrap." "Don't call it crap." "I didn't. I said, 'it's scrap' " "Who's on first again?" ;^) -- Mark Borgerding
"Mark Borgerding" <mark@borgerding.net> wrote in message
news:403C04EB.40705@borgerding.net...

Well, I didn't try to follow all the explanations because what's being
discussed seems so clear.
Maybe this will shed some light:

First, you're talking about fast convolution using FFT/multiply/IFFT, right?

When this is done, it's possible to register the output in time so that
there is no "time delay" by making the filter "non-causal".
In fast (circular) convolution, this is done by rotating the filter 1/2 its
length.  This puts half the filter temporal response at the end of the
periodic epoch or "in the future".  But it doesn't get rid of half of the
transient response.

In order to get rid of the up-front transient response you have to rotate
the filter its full length (less one sample) so that the entire filter is
"in the future".
Of course, you really don't get rid of the transient response, it just shows
up at the end of the output sequence instead of the beginning.  This can be
considered to be a transient precursor.

In classical system theory we don't generally allow for non-causal filters.
That only happens in post-acquisition processing.  In real-time processing
we're usually content to use the former and accept the delay that's inherent
in the process.

So, I'd have to guess that this overlap-scrap method doesn't help with the
classical delay at all. However, it may well help with what we might call
"latency" - delay of another type - because it appears some improvement in
processing time is offered.

Why doesn't it help with classical delay?  Because one has to fill the
temporal epoch with data before the processing can take place and because
there is still data that is scrapped or overlapped or whatever.

I can imagine two time lines;

1) Overlap method with filter length L:
Gather data for the entire temporal epoch NT
FFT/multiply/IFFT
Throw out the first L-1 samples - the transient response.
Result is N-L+1 samples with the first valid sample at time NT-N+L-1 =
N(T-1)+L-1
the first valid sample is delayed by NT plus the processing time P1.

2) With filter length L rotated L-1 samples
Gather data for the entire temporal epoch NT
FFT/multiply/IFFT
Result is N-L+1 samples with the first valid sample at time INDEX zero.
The first valid sample is delayed by NT plus the processing time P2.

It is interesting to note that the next result can be plopped into the
output buffer without worrying about the transient part that shows up at the
end of each epoch - because one can simply overwrite it.  Cool.
On the other hand, in (1) one can also ignore the transient region of
samples.

So, comparing (2) to (1), it appears that there could be a reduced Total
Delay if the processing is actually less.  However, the initial delay is the
same after this is taken into account.  And, I really don't see how the
processing is less because generally one is using a circular output buffer,
no?  If that's the case it's only a matter of indexing to figure out where
to begin reading out and where the transient part originally sat.

Did I get it right?

Fred




Fred Marshall wrote:

   ...

> Did I get it right? > > Fred
One of the things I learned here is not to shoot my mouth off too often without careful calculation, and I've exceeded my quota recently. That's why I started this inquiry asking about real-time performance, but made no claims one way or t'other. Mark's discussion about process never addressed the startup transient as far as I could see, so I held my peace. You make the very same argument that I had in mind, but better. So I can't say you got it right, but at least we agree. FWIW to you. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Fred Marshall wrote:
> "Mark Borgerding" <mark@borgerding.net> wrote in message > news:403C04EB.40705@borgerding.net...
[snip]
> Of course, you really don't get rid of the transient response, it just shows > up at the end of the output sequence instead of the beginning. This can be > considered to be a transient precursor.
Bingo!
> So, I'd have to guess that this overlap-scrap method doesn't help with the > classical delay at all. However, it may well help with what we might call > "latency" - delay of another type - because it appears some improvement in > processing time is offered.
It doesn't get rid of any delay at all. Blocks come in. Blocks go out. It doesn't necessarily even improve processing time, because it doesn't eliminate any arithmetic operations. The only possible benefit is that it can allow an implementor to skip a buffer copy. I never claimed anything more. I'm not making any non-causal time machines here. I'm just rotating the output blocks by circular logic and convoluted thinking :)
> It is interesting to note that the next result can be plopped into the > output buffer without worrying about the transient part that shows up at the > end of each epoch - because one can simply overwrite it.
Bingo again! > I really don't see how the processing is less because generally one is > using a circular output buffer But if the output from the ifft begins with invalid samples, where can you point the ifft output so that the first valid sample from block N will be placed right after the last valid sample from block N-1? I don't see how a circular output buffer obviates the buffer copy with textbook overlap-save. You might be on to something, but I just can't see it. > Did I get it right? > Fred Yes, and described quite well. -- Mark
Jerry Avins wrote:

> Fred Marshall wrote: > > ... > >> Did I get it right? >> >> Fred > > > Mark's discussion about process never > addressed the startup transient as far as I could see,
I would've thought this did (copied from my earlier post): "The transient response is still there. It's the last nh-1 samples of the output buffer. The invalid output samples get shifted there by a linear phase multiplication in the frequency domain (i.e convolution with delta pulse in the time domain ). " Was this unclear? Thanks for your interest, Mark
Mark Borgerding wrote:

> Jerry Avins wrote: > >> Fred Marshall wrote: >> >> ... >> >>> Did I get it right? >>> >>> Fred >> >> >> >> Mark's discussion about process never >> addressed the startup transient as far as I could see, > > > I would've thought this did (copied from my earlier post): > > "The transient response is still there. It's the last nh-1 samples of > the output buffer. The invalid output samples get shifted there by a > linear phase multiplication in the frequency domain (i.e convolution > with delta pulse in the time domain ). "
Mark, But then it isn't a start-up transient. The ordinary operation of a usual FIR has output less than perfect until as many samples have been accepted as the filter has taps. It is possible to suppress the output until that number of samples have been loaded, discarding the imperfect ones, but that delays when output begins. You seem to claim that perfect output begins immediately with "overlap scrap", before a transversal calculation would make it available or that the initial filter inputs actually produce good output.
> Was this unclear?
I hope so, for my sanity's sake. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;