comp.dsp | Time Delay of a FIR filter| page 2

Reply by robert bristow-johnson ●May 16, 20152015-05-16

On 5/15/15 6:07 AM, Marcel Mueller wrote:
> On 15.05.15 01.50, robert bristow-johnson wrote:
>> all of the FFT blocks *must* have delay (or "latency", whatever word you
>> wanna use) because double-buffering is necessary. the minimum delay
>> must be twice the length of the FIR block.
>

by "block", i meant a block or buffer of samples of signal, *not* a 
"segment" or "partition" of the FIR that we're breaking up regarding the 
Gardner paper.

> AFAIK that's not true. You could choose an FFT size that hes less than
> twice of the filter kernel length.

sure but it cannot be less than the the filter kernel length plus the 
buffer or block length of signal samples minus 1.

> You simply get less output samples
> with each transformation in this case.

yes.  but the minimum delay of that segment of FIR must be at least as 
long as twice the length of that number of output samples.

you  1. input B number of samples
      2. then ping-pong your buffers
      3. then FFT the B samples either zero-padded for overlap-add or
         padded with N-B samples of the previous buffer for overlap-save.
      4. multiply in frequency domain.
      5. iFFT back
      6. ping-pong back to output that block of samples or actually
         add those samples to the outputs from other FIR segments

now the ping-ponging in step 2 and in step 6 are actually the same 
ping-pong, but in 6 it's for the following buffer.

during this processing period, you must guarantee the correctness of the 
B input samples in your buffer at the beginning of your block period 
(immediately after the ping-ponging of step 2) and must not insist on 
the correctness of the B output samples until the very end of that block 
period (immediately before step 6).  that's what double-buffering 
affords you.  but the cost is a minimum delay of two blocks or 2B samples.

> Of course, it makes no sense to
> have only a few samples more. But 167% of the FIR length works quite
> well, e.g. FIR kernel of 6144 and FFT of 8192.

very inefficient.  you do *not* want the FIR kernel to be such a large 
portion of the FFT.

these are the facts i am dealing with:

the FIR is partitioned into M segments:

              M-1
     h[n]  =  SUM  h_m[n-D_m]
              m=0

where  D_m is the delay of the earliest sample of the m-th segment.

     h_m = 0   for n < D_m  and n >= D_(m+1)

that makes the (non-zero) length of the m-th segment

     L_m  =  D_(m+1) - D_m

or

     D_(m+1)  =  D_m  +  L_m           (Eq 1)

D_0 is the minimum latency of this partitioned FIR filter.  any samples 
of h[n] where n < D_0 must be done with a regular FIR (and added in) and 
is not modeled here.

then you have the requirement of avoiding time-aliasing in the circular 
convolution that you do with the FFT:

     N_m  >=   L_m  +  B_m  - 1        (Eq 2)

where N_m is the FFT size of the m-th segment and B_m is the number of 
samples processed in a block of samples for the m-th segment.

but i maintain that you must double-buffer those B_m samples so that 
they are *all* input before any FFT is done on them and the results of 
the circular convolution are not needed until the *following* block (so 
you have an entire block time to do the job).  the consequence of that 
double buffering is

     D_m  >=  2*B_m                    (Eq 3)

if you put Eqs 1, 2, and 3 together (and assume equality in Eqs 2 and 3 
for the moment), you get

     D_(m+1)  =   (1/2)*D_m  +  N_m + 1

if you *assume* (you don't have to do this, in fact i recommend against 
it) that you will double the size of your FFT with each segment

     N_(m+1)  =  2 * N_m

then you get something like

     L_m  =  2/3 (N_m + 1)

     B_m  =  1/3 (N_m + 1)

     D_m  =  2/3 (N_m + 1)  =  L_m

and

     D_(m+1) =  2 * D_m .

but you don't have to do that.  you could choose to double the FFT size 
after the second block.  that is

     N_(m+2)  =  2 * N_m

then, assuming even m,

     N_(m+1) = N_m    (same power of two)

you find that

     L_m  =  4/7 (N_m + 1)

     B_m  =  3/7 (N_m + 1)

     D_m  =  6/7 (N_m + 1)

     L_(m+1)  =  2/7 (N_m + 1)

     B_(m+1)  =  5/7 (N_m + 1)

     D_(m+1)  =  10/7 (N_m + 1)

and you will see that

     D_(m+2)  =  12/7 (N_m + 1)  =  2 * D_m

     D_(m+2)  =  D_(m+1) + L_(m+1)  =  D_m  +  L_m  +  L_(m+1)

now the only advantage you get from this is that you have a more 
efficient FFT, especially for the (m+1)-th segment because the length of 
the FIR kernel is much smaller: 2/7 of the FFT size, but it's better for 
both segments.

those are the quantitative facts i am dealing with, Marcel.  having

     L_m  =  3/4  N_m

is just not efficient.

i can send you (or anyone else) two short little pdf files of the above 
analysis and of why we want the FIR kernel length to be about 1/5 or 1/6 
of the FFT size (*not* 3/4 of the FFT).   but you need to send me a 
de-spammed return address.

L8r,

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

Reply by robert bristow-johnson ●May 16, 20152015-05-16

On 5/15/15 8:13 PM, Les Cargill wrote:
> robert bristow-johnson wrote:
...
>>
>> this paper has, as best that i can tell, the least amount of "latency"
>> between the time it was first presented at an AES Convention (Nov 1994)
>> and when published in the Journal (Mar 1995). like 4 friggin' months!
>
> FLAM! ;)
>
>> that's pretty impressive. i've never seen nor heard of any other
>> convention paper go from initial presentation to journal publication in
>> that short period of time.
>>
>>
>
> It's one of the coolest algorithms I have ever read.
>

well, LakeDSP (from down-under) patented it, but that patent i think has 
expired by now.  they beat Bill to the patent office but Bill first 
published it.

there are at least two persons other than either Bill or any of the 
Aussies that told me of the same idea.  it's one thing to conceive it, 
it's another thing to make it work.  but, in Bill's paper, i dunno how i 
would do it.  i don't see how the first buffer of samples can be 
processed in time for them to be overlapped and added into the final 
output.  it's because i can't see a good way to do it other than 
double-buffering.  single-buffering requires that the output be ready 
instantly after the buffer of input samples is formed.  like infinitely 
fast DSP bandwidth.

i know LakeDSP had a product with 56Ks.  so they did it.  i don't know 
how they dealt with the details, though.

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

Reply by Les Cargill ●May 16, 20152015-05-16

robert bristow-johnson wrote:
> On 5/15/15 8:13 PM, Les Cargill wrote:
>> robert bristow-johnson wrote:
> ...
>>>
>>> this paper has, as best that i can tell, the least amount of "latency"
>>> between the time it was first presented at an AES Convention (Nov 1994)
>>> and when published in the Journal (Mar 1995). like 4 friggin' months!
>>
>> FLAM! ;)
>>
>>> that's pretty impressive. i've never seen nor heard of any other
>>> convention paper go from initial presentation to journal publication in
>>> that short period of time.
>>>
>>>
>>
>> It's one of the coolest algorithms I have ever read.
>>
>
> well, LakeDSP (from down-under) patented it, but that patent i think has
> expired by now.  they beat Bill to the patent office but Bill first
> published it.
>
> there are at least two persons other than either Bill or any of the
> Aussies that told me of the same idea.  it's one thing to conceive it,
> it's another thing to make it work.  but, in Bill's paper, i dunno how i
> would do it.  i don't see how the first buffer of samples can be
> processed in time for them to be overlapped and added into the final
> output.

For a convolution reverb, just sacrifice the first block - pass
it through as zeros or as the original samples. For most DAWs
that should be less than a millisecond's loss. Almost certainly < 10 ma.

Either  that or you use brute force, which is good enough for small
vectors.

> it's because i can't see a good way to do it other than
> double-buffering.  single-buffering requires that the output be ready
> instantly after the buffer of input samples is formed.  like infinitely
> fast DSP bandwidth.
>

So you have a rotating/ring buffer... maybe it's two buffers long.

I am biased by the VST plugin interface, in which you get a sample and 
return a sample. You'd still need an intermediate buffer as an accumulator.

When it's time t for sample s[t], you output it.

> i know LakeDSP had a product with 56Ks.  so they did it.  i don't know
> how they dealt with the details, though.
>

-- 
Les Cargill

Reply by robert bristow-johnson ●May 17, 20152015-05-17

On 5/16/15 8:35 PM, Les Cargill wrote:
> robert bristow-johnson wrote:
>> On 5/15/15 8:13 PM, Les Cargill wrote:
>>> robert bristow-johnson wrote:
>> ...
>>>>
>>>> this paper has, as best that i can tell, the least amount of "latency"
>>>> between the time it was first presented at an AES Convention (Nov 1994)
>>>> and when published in the Journal (Mar 1995). like 4 friggin' months!
>>>
>>> FLAM! ;)
>>>
>>>> that's pretty impressive. i've never seen nor heard of any other
>>>> convention paper go from initial presentation to journal publication in
>>>> that short period of time.
>>>>
>>>>
>>>
>>> It's one of the coolest algorithms I have ever read.
>>>
>>
>> well, LakeDSP (from down-under) patented it, but that patent i think has
>> expired by now. they beat Bill to the patent office but Bill first
>> published it.
>>
>> there are at least two persons other than either Bill or any of the
>> Aussies that told me of the same idea. it's one thing to conceive it,
>> it's another thing to make it work. but, in Bill's paper, i dunno how i
>> would do it. i don't see how the first buffer of samples can be
>> processed in time for them to be overlapped and added into the final
>> output.
>
> For a convolution reverb, just sacrifice the first block - pass
> it through as zeros or as the original samples. For most DAWs
> that should be less than a millisecond's loss. Almost certainly < 10 ma.
>

this is not a problem of initialization.  it's a problem of keeping the 
pipelines full of samples in a real-time process.

the whole issue is what time is necessary to first guarantee that *all* 
of the input samples of the block are ready (before you pass them to the 
FFT) and then to guarantee, with the fewest of assumptions, that the 
output samples of the block are ready to add them up with the output 
samples of the other FIR segments.

first you have to wait until all B samples are input before you can say 
your input block is good.  that's one block delay.

then, the least assumption is that your fast-convolution (FFT, 
H[k]*X[k], iFFT) must get done before the *next* block of input is 
coming in.  that's only assuming real-time operation, that your CPU will 
be fast enough for that, and *nothing* else.  you cannot count on any of 
the output samples being good until the end of that block period, *then* 
you can start adding them in.  that's the second block delay, another B 
samples of delay.  *then* you're ready to add those samples in with 
those coming from other FIR segments.

that's double buffering.  if you had an infinitely fast computer, you 
could do this with a single block delay.  if you make some *valid* 
assumptions about how fast you can expect your FIR segment convolution 
to get done processing your block, you can get a delay for that block 
that is somewhere in between B and 2B.

but i wouldn't make that kind of assumption in a multi-threaded 
environment and i wouldn't likely do this low-latency fast convolution 
(a la Gardner) in anything other than a multi-threaded environment. 
*then* the only assumption i can make is that the fast-convolver for a 
particular FIR segment will get its job done in time before the next 
block is input for it and after its own block is input.  that's double 
buffering and the minimum delay is 2B samples.  so that portion or 
partition or segment of the FIR (whatever we call it) *must* be delayed 
by at least 2B samples.

> Either that or you use brute force, which is good enough for small
> vectors.

by that, i presume you mean the non-FFT, regular non-fast convolution 
method.

>> it's because i can't see a good way to do it other than
>> double-buffering. single-buffering requires that the output be ready
>> instantly after the buffer of input samples is formed. like infinitely
>> fast DSP bandwidth.
>>
>
> So you have a rotating/ring buffer... maybe it's two buffers long.

yes, it is.  and you ping-pong to the two buffers, which is the same as 
using one half of the ring buffer while the other half is being I/O.

>
> I am biased by the VST plugin interface, in which you get a sample and
> return a sample.

now, the only thing that *could* be hardware specific is that a sample 
comes in from the A/D simultaneously when another sample goes out to the 
D/A.  in a real-time environment, and assuming *only* that your CPU is 
fast enough to process and return that output sample by the end of the 
sampling period, how much is the latency?   it must be 2 samples.  the 
sample output to the D/A is the sample you returned in the previous 
sample time.  and the input sample you are presented with at the 
beginning of your sampling peried was obtained from from the previous 
sample time.

in real-time, if at the beginning of your sampling period, you are 
guaranteed the correctness of your input sample, and if you do *not* 
have to guarantee the correctness of your output sample until the end of 
your sampling period, you have two samples delay.  even if the algorithm 
is a wire.  now the output of a wire can be computed right away at the 
beginning of the sampling period, but the system does not *assume* that 
that output is ready until the end of the sampling period.

> You'd still need an intermediate buffer as an accumulator.
>
> When it's time t for sample s[t], you output it.

i understand all that.  even though i have not (yet) done a 
convolutional reverb a la Gardner, i *have* done a couple of 
fast-convolution implementations since grad school.

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

Reply by Les Cargill ●May 18, 20152015-05-18

robert bristow-johnson wrote:
> On 5/16/15 8:35 PM, Les Cargill wrote:
>> robert bristow-johnson wrote:
>>> On 5/15/15 8:13 PM, Les Cargill wrote:
>>>> robert bristow-johnson wrote:
>>> ...
>>>>>
>>>>> this paper has, as best that i can tell, the least amount of "latency"
>>>>> between the time it was first presented at an AES Convention (Nov
>>>>> 1994)
>>>>> and when published in the Journal (Mar 1995). like 4 friggin' months!
>>>>
>>>> FLAM! ;)
>>>>
>>>>> that's pretty impressive. i've never seen nor heard of any other
>>>>> convention paper go from initial presentation to journal
>>>>> publication in
>>>>> that short period of time.
>>>>>
>>>>>
>>>>
>>>> It's one of the coolest algorithms I have ever read.
>>>>
>>>
>>> well, LakeDSP (from down-under) patented it, but that patent i think has
>>> expired by now. they beat Bill to the patent office but Bill first
>>> published it.
>>>
>>> there are at least two persons other than either Bill or any of the
>>> Aussies that told me of the same idea. it's one thing to conceive it,
>>> it's another thing to make it work. but, in Bill's paper, i dunno how i
>>> would do it. i don't see how the first buffer of samples can be
>>> processed in time for them to be overlapped and added into the final
>>> output.
>>
>> For a convolution reverb, just sacrifice the first block - pass
>> it through as zeros or as the original samples. For most DAWs
>> that should be less than a millisecond's loss. Almost certainly < 10 ma.
>>
>
> this is not a problem of initialization.  it's a problem of keeping the
> pipelines full of samples in a real-time process.
>
> the whole issue is what time is necessary to first guarantee that *all*
> of the input samples of the block are ready (before you pass them to the
> FFT) and then to guarantee, with the fewest of assumptions, that the
> output samples of the block are ready to add them up with the output
> samples of the other FIR segments.
>
> first you have to wait until all B samples are input before you can say
> your input block is good.  that's one block delay.
>
> then, the least assumption is that your fast-convolution (FFT,
> H[k]*X[k], iFFT) must get done before the *next* block of input is
> coming in.  that's only assuming real-time operation, that your CPU will
> be fast enough for that, and *nothing* else.  you cannot count on any of
> the output samples being good until the end of that block period, *then*
> you can start adding them in.  that's the second block delay, another B
> samples of delay.  *then* you're ready to add those samples in with
> those coming from other FIR segments.
>
> that's double buffering.  if you had an infinitely fast computer, you
> could do this with a single block delay.  if you make some *valid*
> assumptions about how fast you can expect your FIR segment convolution
> to get done processing your block, you can get a delay for that block
> that is somewhere in between B and 2B.
>
> but i wouldn't make that kind of assumption in a multi-threaded
> environment and i wouldn't likely do this low-latency fast convolution
> (a la Gardner) in anything other than a multi-threaded environment.
> *then* the only assumption i can make is that the fast-convolver for a
> particular FIR segment will get its job done in time before the next
> block is input for it and after its own block is input.  that's double
> buffering and the minimum delay is 2B samples.  so that portion or
> partition or segment of the FIR (whatever we call it) *must* be delayed
> by at least 2B samples.
>

That well could be. I'd almost have to go through the exercise to know 
for sure.

I am thinking that it is more like 1B worth of samples at first are 
simply lost. You have at least had the opportunity to do all processing
in terms of data availability by the time the second block is
received.

And I am totally waving hands at the actual hardware interface.




>> Either that or you use brute force, which is good enough for small
>> vectors.
>
> by that, i presume you mean the non-FFT, regular non-fast convolution
> method.
>
>>> it's because i can't see a good way to do it other than
>>> double-buffering. single-buffering requires that the output be ready
>>> instantly after the buffer of input samples is formed. like infinitely
>>> fast DSP bandwidth.
>>>
>>
>> So you have a rotating/ring buffer... maybe it's two buffers long.
>
> yes, it is.  and you ping-pong to the two buffers, which is the same as
> using one half of the ring buffer while the other half is being I/O.
>
>>
>> I am biased by the VST plugin interface, in which you get a sample and
>> return a sample.
>
> now, the only thing that *could* be hardware specific is that a sample
> comes in from the A/D simultaneously when another sample goes out to the
> D/A.  in a real-time environment, and assuming *only* that your CPU is
> fast enough to process and return that output sample by the end of the
> sampling period, how much is the latency?   it must be 2 samples.

For the aforementioned VST interface, it's more than that. I think
the smallest my ASIO over USB interface goes to is 32 samples.

I don't think you have a 2B problem in addition to that, though.

Aty least for Windows, the plugin threads run on whatever
it is they call the "multimedia" timer driven interface
these days - it's not that far from real time. Obviously, this
can break down.

>  the
> sample output to the D/A is the sample you returned in the previous
> sample time.  and the input sample you are presented with at the
> beginning of your sampling peried was obtained from from the previous
> sample time.
>
> in real-time, if at the beginning of your sampling period, you are
> guaranteed the correctness of your input sample, and if you do *not*
> have to guarantee the correctness of your output sample until the end of
> your sampling period, you have two samples delay.  even if the algorithm
> is a wire.  now the output of a wire can be computed right away at the
> beginning of the sampling period, but the system does not *assume* that
> that output is ready until the end of the sampling period.
>
>> You'd still need an intermediate buffer as an accumulator.
>>
>> When it's time t for sample s[t], you output it.
>
> i understand all that.  even though i have not (yet) done a
> convolutional reverb a la Gardner, i *have* done a couple of
> fast-convolution implementations since grad school.
>
>
>

I've only done a couple, but they weren't real-time.

-- 
Les Cargill

Reply by robert bristow-johnson ●May 19, 20152015-05-19

On 5/18/15 8:04 PM, Les Cargill wrote:
>
...
>
> For the aforementioned VST interface, it's more than that. I think
> the smallest my ASIO over USB interface goes to is 32 samples.
>
> I don't think you have a 2B problem in addition to that, though.
>

*if* samples are being processed in 32 sample blocks, which means you 
have a block of time equal to 32 sampling periods to process those 32 
samples...

*if* all 32 input samples are guaranteed to be valid at the beginning of 
your block of time

and *if* you need not guarantee the validity of any of the 32 output 
samples until the very end of your block of time,

*then* your processing delay must be at least 64 samples.

L8r,


-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

Reply by Steve Pope ●May 19, 20152015-05-19

robert bristow-johnson  <rbj@audioimagination.com> wrote:

>On 5/18/15 8:04 PM, Les Cargill wrote:

>> For the aforementioned VST interface, it's more than that. I think
>> the smallest my ASIO over USB interface goes to is 32 samples.

>> I don't think you have a 2B problem in addition to that, though.

>*if* samples are being processed in 32 sample blocks, which means you 
>have a block of time equal to 32 sampling periods to process those 32 
>samples...
>
>*if* all 32 input samples are guaranteed to be valid at the beginning of 
>your block of time
>
>and *if* you need not guarantee the validity of any of the 32 output 
>samples until the very end of your block of time,
>
>*then* your processing delay must be at least 64 samples.

I would say the latency is ( 32 + (runtime)) samples, where "runtime"
is the amount of time required to compute 32 output samples, and
conceivably could be quite small (e.g. 1 or 2 samples).  

Steve

Reply by robert bristow-johnson ●May 19, 20152015-05-19

On 5/19/15 11:36 AM, Steve Pope wrote:
> robert bristow-johnson<rbj@audioimagination.com>  wrote:
>
>> On 5/18/15 8:04 PM, Les Cargill wrote:
>
>>> For the aforementioned VST interface, it's more than that. I think
>>> the smallest my ASIO over USB interface goes to is 32 samples.
>
>>> I don't think you have a 2B problem in addition to that, though.
>
>> *if* samples are being processed in 32 sample blocks, which means you
>> have a block of time equal to 32 sampling periods to process those 32
>> samples...
>>
>> *if* all 32 input samples are guaranteed to be valid at the beginning of
>> your block of time
>>
>> and *if* you need not guarantee the validity of any of the 32 output
>> samples until the very end of your block of time,
>>
>> *then* your processing delay must be at least 64 samples.
>
> I would say the latency is ( 32 + (runtime)) samples, where "runtime"
> is the amount of time required to compute 32 output samples, and
> conceivably could be quite small (e.g. 1 or 2 samples).
>

and then in that case, your DSP or CPU or whatever is jerking off for 
most of the remaining 30 or 29 samples in the block of time for 32 
samples.  and if you were to *require* a latency of 32 + 1 or 2 samples, 
then you do not have the axiom above that you need not guarantee the 
validity of any of the 32 samples until the very end of your block of time.

what if you wanted to pack in as much processing as possible in your 
real-time DSP processing application?



-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

Reply by Steve Pope ●May 19, 20152015-05-19

robert bristow-johnson  <rbj@audioimagination.com> wrote:

>On 5/19/15 11:36 AM, Steve Pope wrote:

>> I would say the latency is ( 32 + (runtime)) samples, where "runtime"
>> is the amount of time required to compute 32 output samples, and
>> conceivably could be quite small (e.g. 1 or 2 samples).

>and then in that case, your DSP or CPU or whatever is jerking off for 
>most of the remaining 30 or 29 samples in the block of time for 32 
>samples.  

Right.  I think the usual engineering terminology is more like the
DSP/CPU is "underutilized", or "has a low duty cycle"...

>and if you were to *require* a latency of 32 + 1 or 2 samples, 
>then you do not have the axiom above that you need not guarantee the 
>validity of any of the 32 samples until the very end of your block of time.

If it's okay to sometimes emit invalid data, perhaps flagging
it as such, you can conceivably save on hardware.  This was
commonly done in old CD players since the Red book specs (BLER etc.)
were quite liberal and ultimately there is an error-masking
feature applied just before the DAC's.

>what if you wanted to pack in as much processing as possible in your 
>real-time DSP processing application?

Underutilized hardware is a problem, sometimes.  In other
scenarios the power-savings from running hardware at a low
duty cycle is more important that the silicon area.

It's scenario-dependent.

Steve

Reply by Les Cargill ●May 20, 20152015-05-20

robert bristow-johnson wrote:
> On 5/18/15 8:04 PM, Les Cargill wrote:
>>
> ...
>>
>> For the aforementioned VST interface, it's more than that. I think
>> the smallest my ASIO over USB interface goes to is 32 samples.
>>
>> I don't think you have a 2B problem in addition to that, though.
>>
>
> *if* samples are being processed in 32 sample blocks, which means you
> have a block of time equal to 32 sampling periods to process those 32
> samples...
>

You may need less... but that fact may or may not be part of the delay 
calculation.

> *if* all 32 input samples are guaranteed to be valid at the beginning of
> your block of time
>
> and *if* you need not guarantee the validity of any of the 32 output
> samples until the very end of your block of time,
>
> *then* your processing delay must be at least 64 samples.
>
> L8r,
>
>

That seems perfectly reasonable.

The Gardner paper says you can trade delay for reduced CPU cost.
It also says a filter of 128k samples will cost ~427 multiplies.
34 * log2(n) - 151.

He seems to be pretty firm about it being *ZERO* delay, although
you almost get the feeling it's as if it were a proper name.

-- 
Les Cargill

Previous 123 Next

Time Delay of a FIR filter

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group