DSPRelated.com
Forums

Is there a better way to optimize the implementation of IIR filter?

Started by jogging August 4, 2011
Hi, all
     Recently I work on speedup of the implementation
 of IIR filter. Previouly my work focuses on video codec.
I have litter experience on speech processing.
     In the processing, a IIR filter is used.
The code looks like the following code::
            /* Implement the full filter */
            while( (Nx--) > 0 )
            {
                z0 = (*x) - a1 * z1 - a2 * z2;
                *(x++) = b0 * z0 + b1 * z1 + b2 * z2;
                z2 = z1;
                z1 = z0;
            }

The data dependency makes it difficult to
optimize it using SIMD instructions.
Is there a better way to optimize the implementation of IIR filter?

Best Regards
Jogging

jogging wrote:

> Hi, all > Recently I work on speedup of the implementation > of IIR filter. Previouly my work focuses on video codec. > I have litter experience on speech processing.
Stupident.
> In the processing, a IIR filter is used. > The code looks like the following code:: > /* Implement the full filter */ > while( (Nx--) > 0 ) > { > z0 = (*x) - a1 * z1 - a2 * z2; > *(x++) = b0 * z0 + b1 * z1 + b2 * z2; > z2 = z1; > z1 = z0; > } > > The data dependency makes it difficult to > optimize it using SIMD instructions. > Is there a better way to optimize the implementation of IIR filter?
Compute all of the a1*z[i]+a2[z+1]+b1*z[i+2]+b2*z[i+3] in the first pass. Do the data dependent part in the second pass. VLV
On 8/4/11 12:03 PM, jogging wrote:
> Hi, all > Recently I work on speedup of the implementation > of IIR filter. Previouly my work focuses on video codec. > I have litter experience on speech processing. > In the processing, a IIR filter is used. > The code looks like the following code:: > /* Implement the full filter */ > while( (Nx--)> 0 ) > { > z0 = (*x) - a1*z1 - a2*z2; > *(x++) = b0*z0 + b1*z1 + b2*z2; > z2 = z1; > z1 = z0; > }
this is pretty clearly the Direct Form II (DF2) biquad. you better be doing this in floating point, otherwise if your biquad is decently resonant, you will have internal clipping problems at the z0 node. if you're doing this in fixed point, you should use the Direct Form I (DF1).
> The data dependency makes it difficult to > optimize it using SIMD instructions.
i don't get that, but i also don't get Vlad's response to it. i also don't see how coding SIMD in C works. what are you using? some SHArC DSP?
> Is there a better way to optimize the implementation of IIR filter?
in straight C code, this is what lives in some old file of mine: typedef struct { float Fs; enum EQ_type type; float f0; float dBgain; float Q; float filterCoefficients[5]; // b0, b1, b2, a1, a2 (a0 = 1) float graphCoefficients[6]; // B2, B1, B0, A2, A1, A0 } sectionData; void filterBuffer( int Nsamples, int Nsections, float *input, float *output, float *filterStates, sectionData **sections ) { register int n, i; register sectionData **section_ptr; register float sampleValue, state1, state2, *state_ptr, *coef_ptr; for (n=0; n<Nsamples; n++) { section_ptr = sections; // reset section pointer state_ptr = filterStates; // reset state pointer sampleValue = input[n]; // get input sample state2 = *state_ptr++; // x[n-2] state1 = *state_ptr--; // x[n-1] *state_ptr++ = state1; // x[n-1] -> x[n-2] *state_ptr++ = sampleValue; // x[n] -> x[n-1] i = Nsections; while (--i >= 0) { coef_ptr = &((*section_ptr++)->filterCoefficients[0]); sampleValue = *coef_ptr++ * sampleValue; sampleValue += *coef_ptr++ * state1; sampleValue += *coef_ptr++ * state2; state2 = *state_ptr++; state1 = *state_ptr--; *state_ptr++ = state1; sampleValue -= *coef_ptr++ * state1; sampleValue -= *coef_ptr++ * state2; *state_ptr++ = sampleValue; } output[n] = sampleValue; // put output sample } } dunno how optimum it is. speed and clarity was my motivation. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
I assume all your variables are floats because otherwise your code does not
work. Therefore I also assume your code runs on a PC.
If your biquad routine must be generic (that is, you don't know in advance
anything about the filter type) there aren't that much you can do in C I
think, maybe except from loop unrolling. If your target is not a PC it is a
different story. 

>On 8/4/11 12:03 PM, jogging wrote: >> Hi, all >> Recently I work on speedup of the implementation >> of IIR filter. Previouly my work focuses on video codec. >> I have litter experience on speech processing. >> In the processing, a IIR filter is used. >> The code looks like the following code:: >> /* Implement the full filter */ >> while( (Nx--)> 0 ) >> { >> z0 = (*x) - a1*z1 - a2*z2; >> *(x++) = b0*z0 + b1*z1 + b2*z2; >> z2 = z1; >> z1 = z0; >> } > >this is pretty clearly the Direct Form II (DF2) biquad. you better be >doing this in floating point, otherwise if your biquad is decently >resonant, you will have internal clipping problems at the z0 node. > >if you're doing this in fixed point, you should use the Direct Form I
(DF1).
> > >> The data dependency makes it difficult to >> optimize it using SIMD instructions. > >i don't get that, but i also don't get Vlad's response to it. i also >don't see how coding SIMD in C works. what are you using? some SHArC
DSP?
> >> Is there a better way to optimize the implementation of IIR filter? > >in straight C code, this is what lives in some old file of mine: > > > > > > >typedef struct > { > float Fs; > enum EQ_type type; > float f0; > float dBgain; > float Q; > float filterCoefficients[5]; // b0, b1, b2, a1, a2 (a0 = 1) > float graphCoefficients[6]; // B2, B1, B0, A2, A1, A0 > } sectionData; > > >void filterBuffer( int Nsamples, > int Nsections, > float *input, > float *output, > float *filterStates, > sectionData **sections ) > { > register int n, i; > register sectionData **section_ptr; > register float sampleValue, state1, state2, *state_ptr, *coef_ptr; > > for (n=0; n<Nsamples; n++) > { > section_ptr = sections; // reset section pointer > state_ptr = filterStates; // reset state pointer > > sampleValue = input[n]; // get input sample > > state2 = *state_ptr++; // x[n-2] > state1 = *state_ptr--; // x[n-1] > *state_ptr++ = state1; // x[n-1] -> x[n-2] > *state_ptr++ = sampleValue; // x[n] -> x[n-1] > > i = Nsections; > while (--i >= 0) > { > coef_ptr = &((*section_ptr++)->filterCoefficients[0]); > > sampleValue = *coef_ptr++ * sampleValue; > sampleValue += *coef_ptr++ * state1; > sampleValue += *coef_ptr++ * state2; > > state2 = *state_ptr++; > state1 = *state_ptr--; > *state_ptr++ = state1; > > sampleValue -= *coef_ptr++ * state1; > sampleValue -= *coef_ptr++ * state2; > > *state_ptr++ = sampleValue; > } > > output[n] = sampleValue; // put output sample > } > } > > >dunno how optimum it is. speed and clarity was my motivation. > > > >-- > >r b-j rbj@audioimagination.com > >"Imagination is more important than knowledge." > > >
Yes, you are corrent. The calculation uses floating point. I try to optimize the code on ARM A8 using NEON(like SSE on x86). To take advantage of the NEON, parallel calculation is expected, so SIMD(single instruction multi data) can be used. Best Regards Jogging
> > >jogging wrote: > >> Hi, all >> Recently I work on speedup of the implementation >> of IIR filter. Previouly my work focuses on video codec. >> I have litter experience on speech processing. > >Stupident. > > >> In the processing, a IIR filter is used. >> The code looks like the following code:: >> /* Implement the full filter */ >> while( (Nx--) > 0 ) >> { >> z0 = (*x) - a1 * z1 - a2 * z2; >> *(x++) = b0 * z0 + b1 * z1 + b2 * z2; >> z2 = z1; >> z1 = z0; >> } >> >> The data dependency makes it difficult to >> optimize it using SIMD instructions. >> Is there a better way to optimize the implementation of IIR filter? > >Compute all of the a1*z[i]+a2[z+1]+b1*z[i+2]+b2*z[i+3] in the first >pass. Do the data dependent part in the second pass. > >VLV >
Thanks, Vladimir Can you describe your idea in detail? I am not very familiar with filter implementation. Best Regards Jogging
On Mon, 08 Aug 2011 10:58:49 -0500, mountaineer wrote:


>> >>jogging wrote: >> >>> Hi, all >>> Recently I work on speedup of the implementation >>> of IIR filter. Previouly my work focuses on video codec. >>> I have litter experience on speech processing. >> >>Stupident. >> >> >>> In the processing, a IIR filter is used. >>> The code looks like the following code:: >>> /* Implement the full filter */ >>> while( (Nx--) > 0 ) >>> { >>> z0 = (*x) - a1 * z1 - a2 * z2; >>> *(x++) = b0 * z0 + b1 * z1 + b2 * z2; z2 = z1; >>> z1 = z0; >>> } >>> >>> The data dependency makes it difficult to optimize it using SIMD >>> instructions. >>> Is there a better way to optimize the implementation of IIR filter? >> >>Compute all of the a1*z[i]+a2[z+1]+b1*z[i+2]+b2*z[i+3] in the first >>pass. Do the data dependent part in the second pass. >> >>VLV >> >> > Thanks, Vladimir > > Can you describe your idea in detail? I am not very familiar with filter > implementation.
The operation: accumulator += a * b; can often be found as a single instruction on a number of different processors, under the name "MAC" (multiply and accumulate). If you have your data and coefficients arranged in vectors, then there's a good chance that you can do accumulator = 0; accumulator += a[0] * x[0]; accumulator += a[1] * x[1]; (etc.) quite fast. "Real" DSP chips will do a multiply and accumulate, increment the index registers, and check to see if the loop has finished, all in one clock cycles. Some general purpose processors come close. So see how much of your algorithm you can cast into the above form, or some other form that is more applicable to your processor. -- www.wescottdesign.com
>On Mon, 08 Aug 2011 10:58:49 -0500, mountaineer wrote: > > >>> >>>jogging wrote: >>> >>>> Hi, all >>>> Recently I work on speedup of the implementation >>>> of IIR filter. Previouly my work focuses on video codec. >>>> I have litter experience on speech processing. >>> >>>Stupident. >>> >>> >>>> In the processing, a IIR filter is used. >>>> The code looks like the following code:: >>>> /* Implement the full filter */ >>>> while( (Nx--) > 0 ) >>>> { >>>> z0 = (*x) - a1 * z1 - a2 * z2; >>>> *(x++) = b0 * z0 + b1 * z1 + b2 * z2; z2 = z1; >>>> z1 = z0; >>>> } >>>> >>>> The data dependency makes it difficult to optimize it using SIMD >>>> instructions. >>>> Is there a better way to optimize the implementation of IIR filter? >>> >>>Compute all of the a1*z[i]+a2[z+1]+b1*z[i+2]+b2*z[i+3] in the first >>>pass. Do the data dependent part in the second pass. >>> >>>VLV >>> >>> >> Thanks, Vladimir >> >> Can you describe your idea in detail? I am not very familiar with
filter
>> implementation. > >The operation: > >accumulator += a * b; > >can often be found as a single instruction on a number of different >processors, under the name "MAC" (multiply and accumulate). If you have >your data and coefficients arranged in vectors, then there's a good >chance that you can do > >accumulator = 0; >accumulator += a[0] * x[0]; >accumulator += a[1] * x[1]; >(etc.) > >quite fast. "Real" DSP chips will do a multiply and accumulate, >increment the index registers, and check to see if the loop has finished,
>all in one clock cycles. Some general purpose processors come close. > >So see how much of your algorithm you can cast into the above form, or >some other form that is more applicable to your processor. > >-- >www.wescottdesign.com >
Thanks, Tim. What you mean is that some operation can be performed in parallel when calculating one sample. I wonder whether there is a way to calculate several sample in parallel. Regards Jogging
On Tue, 09 Aug 2011 08:56:34 -0500, mountaineer wrote:

>>On Mon, 08 Aug 2011 10:58:49 -0500, mountaineer wrote: >> >> >> >>>>jogging wrote: >>>> >>>>> Hi, all >>>>> Recently I work on speedup of the implementation >>>>> of IIR filter. Previouly my work focuses on video codec. >>>>> I have litter experience on speech processing. >>>> >>>>Stupident. >>>> >>>> >>>>> In the processing, a IIR filter is used. >>>>> The code looks like the following code:: >>>>> /* Implement the full filter */ >>>>> while( (Nx--) > 0 ) >>>>> { >>>>> z0 = (*x) - a1 * z1 - a2 * z2; >>>>> *(x++) = b0 * z0 + b1 * z1 + b2 * z2; z2 = z1; z1 = >>>>> z0; >>>>> } >>>>> >>>>> The data dependency makes it difficult to optimize it using SIMD >>>>> instructions. >>>>> Is there a better way to optimize the implementation of IIR filter? >>>> >>>>Compute all of the a1*z[i]+a2[z+1]+b1*z[i+2]+b2*z[i+3] in the first >>>>pass. Do the data dependent part in the second pass. >>>> >>>>VLV >>>> >>>> >>> Thanks, Vladimir >>> >>> Can you describe your idea in detail? I am not very familiar with > filter >>> implementation. >> >>The operation: >> >>accumulator += a * b; >> >>can often be found as a single instruction on a number of different >>processors, under the name "MAC" (multiply and accumulate). If you have >>your data and coefficients arranged in vectors, then there's a good >>chance that you can do >> >>accumulator = 0; >>accumulator += a[0] * x[0]; >>accumulator += a[1] * x[1]; >>(etc.) >> >>quite fast. "Real" DSP chips will do a multiply and accumulate, >>increment the index registers, and check to see if the loop has >>finished, > >>all in one clock cycles. Some general purpose processors come close. >> >>So see how much of your algorithm you can cast into the above form, or >>some other form that is more applicable to your processor. >> >>-- >>www.wescottdesign.com >> > Thanks, Tim. > What you mean is that some operation can be performed in parallel when > calculating one sample. I wonder whether there is a way to calculate > several sample in parallel.
The only two ways I know to do that are to either go through the instruction set of the processor (assuming that you want to do this in assembly) and figure out the puzzle yourself, or to find a white paper written by someone trying to sell you on using that processor. -- www.wescottdesign.com
On 8/9/11 1:10 PM, Tim Wescott wrote:
> On Tue, 09 Aug 2011 08:56:34 -0500, mountaineer wrote: >
...
>> What you mean is that some operation can be performed in parallel when >> calculating one sample. I wonder whether there is a way to calculate >> several sample in parallel. > > The only two ways I know to do that are to either go through the > instruction set of the processor (assuming that you want to do this in > assembly) and figure out the puzzle yourself, or to find a white paper > written by someone trying to sell you on using that processor. >
i think that the OP's basic issue (for which i have no answer) is that a recursive alg that depends on an output or state from the immediately previous sample that's for a *single* channel cannot be coded in SIMD since the parallel operation depends on the result of the other concurrent operation. isn't that it? -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."