comp.dsp | Time Delay of a FIR filter| page 3

Reply by robert bristow-johnson ●May 20, 20152015-05-20

On 5/19/15 5:21 PM, Steve Pope wrote:
> robert bristow-johnson<rbj@audioimagination.com>  wrote:
>
>> On 5/19/15 11:36 AM, Steve Pope wrote:
>
>>> I would say the latency is ( 32 + (runtime)) samples, where "runtime"
>>> is the amount of time required to compute 32 output samples, and
>>> conceivably could be quite small (e.g. 1 or 2 samples).
>
>> and then in that case, your DSP or CPU or whatever is jerking off for
>> most of the remaining 30 or 29 samples in the block of time for 32
>> samples.
>
> Right.  I think the usual engineering terminology is more like the
> DSP/CPU is "underutilized", or "has a low duty cycle"...

well, you know what they say: "hair grows on the palms of idle hands."

>
>> and if you were to *require* a latency of 32 + 1 or 2 samples,
>> then you do not have the axiom above that you need not guarantee the
>> validity of any of the 32 samples until the very end of your block of time.
>
> If it's okay to sometimes emit invalid data, perhaps flagging
> it as such, you can conceivably save on hardware.

it certainly isn't simple.  to me this fast convolution thingie does 
this in a block time

1.  FFT the input data.  (must wait until all 32 samples are input.)
2.  multiply H[k] x X[k]
3.  iFFT the output  (then you can't really get to the earliest samples 
in your blcok.

it should be simple:  you have 32/Fs seconds to take this buffer or 
vector of 32 input samples into another chunk of samples.  you need all 
32 input samples to be good before you can begin and at the end with the 
iFFT, you'll not likely have the last sample known before the first.

>  This was
> commonly done in old CD players since the Red book specs (BLER etc.)
> were quite liberal and ultimately there is an error-masking
> feature applied just before the DAC's.

it might be dither.

might be noise shaping.

could even be both.

>
>> what if you wanted to pack in as much processing as possible in your
>> real-time DSP processing application?
>
> Underutilized hardware is a problem, sometimes.  In other
> scenarios the power-savings from running hardware at a low
> duty cycle is more important that the silicon area.
>
> It's scenario-dependent.

okay, but the motivation behind my question was regarding this 
multi-segmented FIR convolution.  if i had a grossly "underutilized" DSP 
that is jerking off for 90% of the time, i would wanna put that 
lollygagger to useful work.  you get more "utilization" out the DSP by 
increasing the length of the FIR filter kernal.

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

Reply by Marcel Mueller ●May 20, 20152015-05-20

On 17.05.15 01.26, robert bristow-johnson wrote:
>> Of course, it makes no sense to
>> have only a few samples more. But 167% of the FIR length works quite
>> well, e.g. FIR kernel of 6144 and FFT of 8192.
>
> very inefficient.  you do *not* want the FIR kernel to be such a large
> portion of the FFT.

Why not? OK, usually I choose 1:1, but the above is from an equalizer 
plugin. Although the latency is normally hidden by the prefetch of the 
player engine large latencies tend to make seek operations lazy. With 
the above configuration the response time to seek is less than 50ms 
including network latency for PulseAudio playback on a remote server. Of 
course, the prefetch engine is still part of the solution. Most of the 
time I use twice as much samples (12k FIR/16k FFT) for higher accuracy 
and group delay equalization.

> these are the facts i am dealing with:
>
> the FIR is partitioned into M segments:
>
>               M-1
>      h[n]  =  SUM  h_m[n-D_m]
>               m=0
[...]
> now the only advantage you get from this is that you have a more
> efficient FFT, especially for the (m+1)-th segment because the length of
> the FIR kernel is much smaller: 2/7 of the FFT size, but it's better for
> both segments.
>
> those are the quantitative facts i am dealing with, Marcel.  having
>
>      L_m  =  3/4  N_m
>
> is just not efficient.

Sorry, I didn't catch what you intend to tell me here.

Pretty sure, there are more calculations needed when the overlap window 
is large, simply because more FFTs have to be calculated in the same 
time. Choosing a larger FFT in theory has logarithic complexity. But in 
practice it is worse because most FFTs are memory bound nowadays (even 
on a Raspberry Pi) and cache efficiency degrades largely with the 
working set size as soon as you left the L1 cache. Especially because of 
the scattered access of the FFT algorithm.
The above, "inefficient" algorithm even works in real time on a 450 MHz 
AMD of the 90s (48kHz, 2 channel), although prefetching causes peaks 
with 100% CPU load on this device. Choosing larger FFTs does not really 
improve the situation, except that the peaks from prefetching take 
significantly longer.

> i can send you (or anyone else) two short little pdf files of the above
> analysis and of why we want the FIR kernel length to be about 1/5 or 1/6
> of the FFT size (*not* 3/4 of the FFT).   but you need to send me a
> de-spammed return address.

I prefer practical tests over theory. And the number of FLOPs is by far 
not the only parameter.

Marcel

Reply by robert bristow-johnson ●May 20, 20152015-05-20

On 5/20/15 1:37 AM, Marcel Mueller wrote:
> On 17.05.15 01.26, robert bristow-johnson wrote:
>>> Of course, it makes no sense to
>>> have only a few samples more. But 167% of the FIR length works quite
>>> well, e.g. FIR kernel of 6144 and FFT of 8192.
>>
>> very inefficient. you do *not* want the FIR kernel to be such a large
>> portion of the FFT.
>
> Why not?
...

>
>> i can send you (or anyone else) two short little pdf files of the above
>> analysis and of why we want the FIR kernel length to be about 1/5 or 1/6
>> of the FFT size (*not* 3/4 of the FFT). but you need to send me a
>> de-spammed return address.
>
> I prefer practical tests over theory. And the number of FLOPs is by far
> not the only parameter.
>

but you can still compare apples to apples.

the FFT costs (per block) something like :

   Cost = C1*N*log2(N) + C2*N + C3*Log2N + C4

for some constants C1, C2, C3, C4 to be determined from analysis or 
profiling of the code performing the FFT and other tasks of the circular 
convolution.

C1 is the cost of performing an FFT butterfly. C2 is the cost of all 
operations performed on each FFT bin such as multiplication of H[k] to
X[k] .  C3 is the overhead cost of setting up each FFT pass and C4
is the total constant overhead of setting up each block or frame. It is 
likely that C3 and C4 can be ignored since they multiply much smaller 
numbers than C1 and C2.

the payoff is processing N - L + 1 samples, where L is the length of the 
FIR kernel.  so the cost per sample to process one block is

     (C1*N*log2(N) + C2*N + C3*Log2N + C4) / (N-L+1)

given a length of FIR L, you want to minimize the above expression.  try 
setting C3 and C4 to zero and C2 approximately to C1 and find what power 
of 2 for N makes the above expression minimum..

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

Previous 1 23Next

Time Delay of a FIR filter

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group