DSPRelated.com
Forums

IIR biquad implementation in fixed point / integer

Started by Eugen April 10, 2014
Hello,

I tried implementing a LPF biquad (that works perfectly in floating
point) with: a = [1.00000, -1.90887, 0.91129], b = [6.1150e-04,
1.2230e-03, 6.1150e-04] in fixed point / integer, with 16 bit
coefficients, 10 bit input signal and a 32 bit accumulator. I suspect
the coefficient resolution is not enough.

I tried:
- the direct form 1 and direct form 2 transposed structures, but the
output is zero if the input signal is not very high -- because of the
small b coefficients, the signal is attenuated very much in the FIR
portion of the filter and is lost;

int biquad_ab(d, a0, a1, a2, b0, b1, b2, a_shift, b_shift)  {
		signed long acc; 
		acc = b0 * d->db[0] + b1 * d->db[1] + b2 * d->db[2]; 
		acc = acc >> b_shift;
		acc = a0 * acc - a1 * d->da[0] - a2 * d->da[1]; 
		acc = acc >> a_shift;
		d->da[1] = d->da[0]; 
		d->da[0] = acc; 
		d->db[2] = d->db[1]; 
		d->db[1] = d->db[0];
		return acc;
}


- the direct form 2, but the filter works OK only with a very small
input signal, otherwise it overflows.

What structure should I use for this biquad? Can it be implemented in
this way considering the small b values, or I should use a 64 bit
accumulator and maybe 32 bit coefficients?

Thanks

On Thu, 10 Apr 2014 21:45:13 +0000, Eugen wrote:

> Hello, > > I tried implementing a LPF biquad (that works perfectly in floating > point) with: a = [1.00000, -1.90887, 0.91129], b = [6.1150e-04, > 1.2230e-03, 6.1150e-04] in fixed point / integer, with 16 bit > coefficients, 10 bit input signal and a 32 bit accumulator. I suspect > the coefficient resolution is not enough. > > I tried: > - the direct form 1 and direct form 2 transposed structures, but the > output is zero if the input signal is not very high -- because of the > small b coefficients, the signal is attenuated very much in the FIR > portion of the filter and is lost; > > int biquad_ab(d, a0, a1, a2, b0, b1, b2, a_shift, b_shift) { > signed long acc; > acc = b0 * d->db[0] + b1 * d->db[1] + b2 * d->db[2]; > acc = acc >> b_shift; > acc = a0 * acc - a1 * d->da[0] - a2 * d->da[1]; > acc = acc >> a_shift; > d->da[1] = d->da[0]; > d->da[0] = acc; > d->db[2] = d->db[1]; > d->db[1] = d->db[0]; > return acc; > } > > > - the direct form 2, but the filter works OK only with a very small > input signal, otherwise it overflows. > > What structure should I use for this biquad? Can it be implemented in > this way considering the small b values, or I should use a 64 bit > accumulator and maybe 32 bit coefficients? > > Thanks
Your states need to be 32 bits as well, not just the accumulator. I'm pretty sure that if you do that DF1 or DF2 transposed should work, but you need to be careful of how you scale things going from input to state. It is very helpful in cases like this to draw a block diagram of the filter and extract a transfer function from input to each state (this is easy if you represent the filter in state-space form). Do a Bode plot of each transfer function and you'll see how big (or small) the magnitude of the signal in each state is; that will help to guide you in scaling things correctly. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 4/10/14 6:27 PM, Tim Wescott wrote:
> On Thu, 10 Apr 2014 21:45:13 +0000, Eugen wrote: > >> Hello, >> >> I tried implementing a LPF biquad (that works perfectly in floating >> point) with: a = [1.00000, -1.90887, 0.91129], b = [6.1150e-04, >> 1.2230e-03, 6.1150e-04] in fixed point / integer, with 16 bit >> coefficients, 10 bit input signal and a 32 bit accumulator. I suspect >> the coefficient resolution is not enough. >>
,,,
> Your states need to be 32 bits as well, not just the accumulator.
not necessarily. if Eugen is willing to accept the 96 dB dynamic range (that is S/N + headroom) from truncating the accumulator to be y[n], then you can do it with a simple Direct Form 1. since the a1 feedback coef is as big as 2 and a2 is as big as 1, then the accumulator should be shifted right 1 bit after adding (2^15)*a2*y[n-2] and then shifted right 14 bits after adding (2^14)*a1*y[n-1] are added.
> I'm > pretty sure that if you do that DF1 or DF2 transposed should work, but > you need to be careful of how you scale things going from input to state. > > It is very helpful in cases like this to draw a block diagram of the > filter and extract a transfer function from input to each state (this is > easy if you represent the filter in state-space form). Do a Bode plot of > each transfer function and you'll see how big (or small) the magnitude of > the signal in each state is; that will help to guide you in scaling > things correctly.
i got some integer C code for fixed-point DF1 biquad somewhere. its got this "fraction saving", essentially 1st-order noise shaping with a zero at DC, in it also. if you're careful with scaling, you can do it. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
On 4/10/14 8:21 PM, robert bristow-johnson wrote:
> > > i got some integer C code for fixed-point DF1 biquad somewhere. its got > this "fraction saving", essentially 1st-order noise shaping with a zero > at DC, in it also. if you're careful with scaling, you can do it. >
can't find that code, but it would look a lot like this: void fixed_DF1_biquad ( int num_samples, int16 *input, int16 *output, int16 *states, int16 *coefs ) { int32 a2 = (int32)(*coefs++); int32 a1 = (int32)(*coefs++); int32 b2 = (int32)(*coefs++); int32 b1 = (int32)(*coefs++); int32 b0 = (int32)(*coefs); int32 x1 = (int32)(*states++); int32 x2 = (int32)(*states++); int32 y1 = (int32)(*states++); int32 y2 = (int32)(*states++); int32 acc = (int32)(*states); acc += y1<<14; a1 -= 0x4000; while (num_samples-- > 0) { acc += a2*y2; acc += a1*y1; y2 = y1; acc += b2*x2; acc += b1*x1; x2 = x1; x1 = (int32)(*(input++)); acc += b0*x1; if (acc > 0x1FFFFFFF) acc = 0x1FFFFFFF; if (acc < -0x20000000) acc = -0x20000000; y1 = acc>>14; *(output++) = (int16)y1; } acc -= y1<<14; *(states--) = (int16)acc; *(states--) = (int16)y2; *(states--) = (int16)y1; *(states--) = (int16)x2; *(states) = (int16)x1; } all coefficients are scaled up by a factor of 16384 or 2^14. because C doesn't seem to know how to multiply two int16s into an int32 without first promoting one of the int16 into int32, i just made the states and the coefs be int32, but the upper 16 bits are only a sign extension. if you were doing this in assembly, you would not bother to do that. one of my complaints about C (and C++) is that the assumed type should increase in width when multiplying. e.g. multiplying two int16 types should result, without casting, in an int32 but C does not do that. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
On Thu, 10 Apr 2014 20:21:22 -0400, robert bristow-johnson wrote:

> On 4/10/14 6:27 PM, Tim Wescott wrote: >> On Thu, 10 Apr 2014 21:45:13 +0000, Eugen wrote: >> >>> Hello, >>> >>> I tried implementing a LPF biquad (that works perfectly in floating >>> point) with: a = [1.00000, -1.90887, 0.91129], b = [6.1150e-04, >>> 1.2230e-03, 6.1150e-04] in fixed point / integer, with 16 bit >>> coefficients, 10 bit input signal and a 32 bit accumulator. I suspect >>> the coefficient resolution is not enough. >>> > ,,, >> Your states need to be 32 bits as well, not just the accumulator. > > not necessarily. if Eugen is willing to accept the 96 dB dynamic range > (that is S/N + headroom) from truncating the accumulator to be y[n], > then you can do it with a simple Direct Form 1.
From his description, though, he can't. "One form truncates, the other form limits" sounds like he's investigated his dynamic range, however naively, and has found that he needs more. Even if his problem is that he doesn't thoroughly understand the issues, in this degenerate age of easy bits, he may find that a solution involving throwing more processor resources at it is superior to a technically better solution of more tightly controlling his gains to place the available dynamic range right where it needs to be. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
Tim Wescott wrote:

> On Thu, 10 Apr 2014 20:21:22 -0400, robert bristow-johnson wrote: > >> On 4/10/14 6:27 PM, Tim Wescott wrote: >>> On Thu, 10 Apr 2014 21:45:13 +0000, Eugen wrote: >>> >>>> Hello, >>>> >>>> I tried implementing a LPF biquad (that works perfectly in floating >>>> point) with: a = [1.00000, -1.90887, 0.91129], b = [6.1150e-04, >>>> 1.2230e-03, 6.1150e-04] in fixed point / integer, with 16 bit >>>> coefficients, 10 bit input signal and a 32 bit accumulator. I suspect >>>> the coefficient resolution is not enough. >>>> >> ,,, >>> Your states need to be 32 bits as well, not just the accumulator. >> >> not necessarily. if Eugen is willing to accept the 96 dB dynamic range >> (that is S/N + headroom) from truncating the accumulator to be y[n], >> then you can do it with a simple Direct Form 1. > > From his description, though, he can't. "One form truncates, the other > form limits" sounds like he's investigated his dynamic range, however > naively, and has found that he needs more. > > Even if his problem is that he doesn't thoroughly understand the issues, > in this degenerate age of easy bits, he may find that a solution > involving throwing more processor resources at it is superior to a > technically better solution of more tightly controlling his gains to > place the available dynamic range right where it needs to be. >
This filter is a simple LPF and, even if the output dynamic range could be increased after filtering, I don't need this. The input is 10 bits and I need a 10 bit output, I expected that for a continous sequence of x = 1 input samples, I would eventually obtain y = 1. As the output states are 0, the a terms contributions are always 0, so I get no output. I got overflow only with DF2 -- I used DF2 because the a terms come first and I hoped this way the signal won't be lost, I know DF2 is not recommended due to overflow. I can't use significant resources, it should run on a dsPIC30 microcontroller and for this reason I prefered the DF2 transposed to minimize memory.
robert bristow-johnson wrote:

> On 4/10/14 8:21 PM, robert bristow-johnson wrote: >> >> >> i got some integer C code for fixed-point DF1 biquad somewhere. its got >> this "fraction saving", essentially 1st-order noise shaping with a zero >> at DC, in it also. if you're careful with scaling, you can do it. >> > > can't find that code, but it would look a lot like this: > > > void fixed_DF1_biquad > ( > int num_samples, > int16 *input, > int16 *output, > int16 *states, > int16 *coefs > ) > { > int32 a2 = (int32)(*coefs++); > int32 a1 = (int32)(*coefs++); > int32 b2 = (int32)(*coefs++); > int32 b1 = (int32)(*coefs++); > int32 b0 = (int32)(*coefs); > > int32 x1 = (int32)(*states++); > int32 x2 = (int32)(*states++); > int32 y1 = (int32)(*states++); > int32 y2 = (int32)(*states++); > int32 acc = (int32)(*states); > acc += y1<<14; > > a1 -= 0x4000; > > while (num_samples-- > 0) > { > acc += a2*y2; > acc += a1*y1; > y2 = y1; > acc += b2*x2; > acc += b1*x1; > x2 = x1; > x1 = (int32)(*(input++)); > acc += b0*x1; > if (acc > 0x1FFFFFFF) acc = 0x1FFFFFFF; > if (acc < -0x20000000) acc = -0x20000000; > y1 = acc>>14; > *(output++) = (int16)y1; > } > > acc -= y1<<14; > *(states--) = (int16)acc; > *(states--) = (int16)y2; > *(states--) = (int16)y1; > *(states--) = (int16)x2; > *(states) = (int16)x1; > } > > > > all coefficients are scaled up by a factor of 16384 or 2^14. > > because C doesn't seem to know how to multiply two int16s into an int32 > without first promoting one of the int16 into int32, i just made the > states and the coefs be int32, but the upper 16 bits are only a sign > extension. if you were doing this in assembly, you would not bother to > do that. > > one of my complaints about C (and C++) is that the assumed type should > increase in width when multiplying. e.g. multiplying two int16 types > should result, without casting, in an int32 but C does not do that. > >
Thanks for the code. I don't understand why it's necessary to subtract 0x4000 from a1 and to saturate the accumulator -- if saturating the accumulator modifies the output signal. My initial code was defined as a macro that gets redefined with constants ai, bi for each biquad, that's why the function definition looks strange, without data types. I forgot to mention that the a and b coefficients are for a biquad that is part of a 4th order Butterworth with fc = 2 Hz, fs = 250 Hz. Maybe a filter with a ratio fs/fc over 100 is too extreme and hard to implement in fixed point.
On Fri, 11 Apr 2014 18:35:32 +0000, Eugen wrote:

> robert bristow-johnson wrote: > >> On 4/10/14 8:21 PM, robert bristow-johnson wrote: >>> >>> >>> i got some integer C code for fixed-point DF1 biquad somewhere. its >>> got this "fraction saving", essentially 1st-order noise shaping with a >>> zero at DC, in it also. if you're careful with scaling, you can do it. >>> >>> >> can't find that code, but it would look a lot like this: >> >> >> void fixed_DF1_biquad ( >> int num_samples, >> int16 *input, >> int16 *output, >> int16 *states, >> int16 *coefs >> ) >> { >> int32 a2 = (int32)(*coefs++); int32 a1 = (int32)(*coefs++); int32 >> b2 = (int32)(*coefs++); int32 b1 = (int32)(*coefs++); int32 b0 = >> (int32)(*coefs); >> >> int32 x1 = (int32)(*states++); >> int32 x2 = (int32)(*states++); >> int32 y1 = (int32)(*states++); >> int32 y2 = (int32)(*states++); >> int32 acc = (int32)(*states); acc += y1<<14; >> >> a1 -= 0x4000; >> >> while (num_samples-- > 0) >> { >> acc += a2*y2; >> acc += a1*y1; >> y2 = y1; >> acc += b2*x2; >> acc += b1*x1; >> x2 = x1; >> x1 = (int32)(*(input++)); >> acc += b0*x1; >> if (acc > 0x1FFFFFFF) acc = 0x1FFFFFFF; >> if (acc < -0x20000000) acc = -0x20000000; >> y1 = acc>>14; *(output++) = (int16)y1; >> } >> >> acc -= y1<<14; >> *(states--) = (int16)acc; *(states--) = (int16)y2; *(states--) = >> (int16)y1; *(states--) = (int16)x2; *(states) = (int16)x1; >> } >> >> >> >> all coefficients are scaled up by a factor of 16384 or 2^14. >> >> because C doesn't seem to know how to multiply two int16s into an int32 >> without first promoting one of the int16 into int32, i just made the >> states and the coefs be int32, but the upper 16 bits are only a sign >> extension. if you were doing this in assembly, you would not bother to >> do that. >> >> one of my complaints about C (and C++) is that the assumed type should >> increase in width when multiplying. e.g. multiplying two int16 types >> should result, without casting, in an int32 but C does not do that. >> >> >> > Thanks for the code. > I don't understand why it's necessary to subtract 0x4000 from a1 and to > saturate the accumulator -- if saturating the accumulator modifies the > output signal. > > My initial code was defined as a macro that gets redefined with > constants ai, bi for each biquad, that's why the function definition > looks strange, without data types. > > I forgot to mention that the a and b coefficients are for a biquad that > is part of a 4th order Butterworth with fc = 2 Hz, fs = 250 Hz. Maybe a > filter with a ratio fs/fc over 100 is too extreme and hard to implement > in fixed point.
It's not hard to implement, but the larger the fs/fc, the more precision you need in your states. For a 2nd-order filter you need roughly 2 * log2(fs/fc) more bits in your states than in your input to contain the numbers. If eight more bytes are really too expensive, then pre-filter the thing with a simple boxcar FIR filter, sub-sample, and then do your IIR filters with a smaller sampling rate. Of course, you may need more than eight bytes of RAM for your FIR filter... -- Tim Wescott Wescott Design Services http://www.wescottdesign.com
On 4/11/14 3:57 PM, Tim Wescott wrote:
> On Fri, 11 Apr 2014 18:35:32 +0000, Eugen wrote: > >> robert bristow-johnson wrote: >> >>> On 4/10/14 8:21 PM, robert bristow-johnson wrote: >>>> >>>> >>>> i got some integer C code for fixed-point DF1 biquad somewhere. its >>>> got this "fraction saving", essentially 1st-order noise shaping with a >>>> zero at DC, in it also. if you're careful with scaling, you can do it. >>>> >>>> >>> can't find that code, but it would look a lot like this: >>> >>> >>> void fixed_DF1_biquad ( >>> int num_samples, >>> int16 *input, >>> int16 *output, >>> int16 *states, >>> int16 *coefs >>> ) >>> { >>> int32 a2 = (int32)(*coefs++); int32 a1 = (int32)(*coefs++); int32 >>> b2 = (int32)(*coefs++); int32 b1 = (int32)(*coefs++); int32 b0 = >>> (int32)(*coefs); >>> >>> int32 x1 = (int32)(*states++); >>> int32 x2 = (int32)(*states++); >>> int32 y1 = (int32)(*states++); >>> int32 y2 = (int32)(*states++); >>> int32 acc = (int32)(*states); acc += y1<<14; >>> >>> a1 -= 0x4000; >>> >>> while (num_samples--> 0) >>> { >>> acc += a2*y2; >>> acc += a1*y1; >>> y2 = y1; >>> acc += b2*x2; >>> acc += b1*x1; >>> x2 = x1; >>> x1 = (int32)(*(input++)); >>> acc += b0*x1; >>> if (acc> 0x1FFFFFFF) acc = 0x1FFFFFFF; >>> if (acc< -0x20000000) acc = -0x20000000; >>> y1 = acc>>14; *(output++) = (int16)y1; >>> } >>> >>> acc -= y1<<14; >>> *(states--) = (int16)acc; *(states--) = (int16)y2; *(states--) = >>> (int16)y1; *(states--) = (int16)x2; *(states) = (int16)x1; >>> } >>> >>> >>> >>> all coefficients are scaled up by a factor of 16384 or 2^14. >>> >>> because C doesn't seem to know how to multiply two int16s into an int32 >>> without first promoting one of the int16 into int32, i just made the >>> states and the coefs be int32, but the upper 16 bits are only a sign >>> extension. if you were doing this in assembly, you would not bother to >>> do that. >>> >>> one of my complaints about C (and C++) is that the assumed type should >>> increase in width when multiplying. e.g. multiplying two int16 types >>> should result, without casting, in an int32 but C does not do that. >>> >>> >>> >> Thanks for the code. >> I don't understand why it's necessary to subtract 0x4000 from a1 and to >> saturate the accumulator -- if saturating the accumulator modifies the >> output signal.
i am subtracting 2^14 times 1 from the integer value of a1 which is the same as subtracting 1 from the naturally scaled a1 coefficient. this is because the accumulator acc is not initialized to zero or even zero plus the noise-shaping bits. actually, it's not such a good thing to do for just 1 integer bit (plus sign bit), it needs another bit if -2 < a1 < -1.5 . the noise-shaping trick here is to always quantize by rounding down and just truncating the bits. but we add those truncated bits back in for the next sample, that puts a zero right on DC, so even though we're always rounding down, it has infinite S/N at DC. that's useful to put a stop to a limit-cycle problem where zero is going into the filter, it gets stuck on a nonzero value because it rounds up after a small decay. i've seen this at -70 dBFS (with 24 bit) and the LED level monitor was just getting stuck after (digitally) muting the channel. could be worse with 16 bit. anyway, instead of just keeping the bottom bits and masking out the top 18 bits, i just leave the bits in the top 18 and subtract them out when a1 and y1 are MAC'd.
>> >> My initial code was defined as a macro that gets redefined with >> constants ai, bi for each biquad, that's why the function definition >> looks strange, without data types. >> >> I forgot to mention that the a and b coefficients are for a biquad that >> is part of a 4th order Butterworth with fc = 2 Hz, fs = 250 Hz. Maybe a >> filter with a ratio fs/fc over 100 is too extreme and hard to implement >> in fixed point.
on thing, your numerator coefs are awful small, me thinks. you might want to scale all 3 of them up a little and compensate the gain change somewhere else in the signal chain.
> > It's not hard to implement, but the larger the fs/fc, the more precision > you need in your states. For a 2nd-order filter you need roughly > 2 * log2(fs/fc) more bits in your states than in your input to contain > the numbers.
you can scale it so that the numbers are good for 16-bit. that's an identical adjustment of scale to b0, b1, b2. for a DF1, the states are scaled the same as the input and output signals. i really think it's about the coefficients. if fc/fs gets too small, your poles are dancing close to z=1 and you might not have enough bits to put your fc and Q exactly where you want them to be.
> > If eight more bytes are really too expensive, then pre-filter the thing > with a simple boxcar FIR filter, sub-sample, and then do your IIR filters > with a smaller sampling rate. > > Of course, you may need more than eight bytes of RAM for your FIR > filter... >
-- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
>Tim Wescott wrote: > >> On Thu, 10 Apr 2014 20:21:22 -0400, robert bristow-johnson wrote: >> >>> On 4/10/14 6:27 PM, Tim Wescott wrote: >>>> On Thu, 10 Apr 2014 21:45:13 +0000, Eugen wrote: >>>> >>>>> Hello, >>>>> >>>>> I tried implementing a LPF biquad (that works perfectly in floating >>>>> point) with: a = [1.00000, -1.90887, 0.91129], b = [6.1150e-04, >>>>> 1.2230e-03, 6.1150e-04] in fixed point / integer, with 16 bit >>>>> coefficients, 10 bit input signal and a 32 bit accumulator. I
suspect
>>>>> the coefficient resolution is not enough. >>>>> >>> ,,, >>>> Your states need to be 32 bits as well, not just the accumulator. >>> >>> not necessarily. if Eugen is willing to accept the 96 dB dynamic
range
>>> (that is S/N + headroom) from truncating the accumulator to be y[n], >>> then you can do it with a simple Direct Form 1. >> >> From his description, though, he can't. "One form truncates, the other
>> form limits" sounds like he's investigated his dynamic range, however >> naively, and has found that he needs more. >> >> Even if his problem is that he doesn't thoroughly understand the issues,
>> in this degenerate age of easy bits, he may find that a solution >> involving throwing more processor resources at it is superior to a >> technically better solution of more tightly controlling his gains to >> place the available dynamic range right where it needs to be. >> > >This filter is a simple LPF and, even if the output dynamic range could >be increased after filtering, I don't need this. > >The input is 10 bits and I need a 10 bit output, I expected that for >a continous sequence of x = 1 input samples, I would eventually obtain y >= 1. As the output states are 0, the a terms contributions are always 0, >so I get no output. > >I got overflow only with DF2 -- I used DF2 because the a terms come >first and I hoped this way the signal won't be lost, I know DF2 is not >recommended due to overflow. > >I can't use significant resources, it should run on a dsPIC30 >microcontroller and for this reason I prefered the DF2 transposed to >minimize memory. >
If I look at your numerator coeffs: there aren't many root locations you can achieve in 16bits. Just figure out the coeffs times 32768. If you code the result in hex and there are only a couple of bits left, you have to live with some inaccuracies of the cuttoff frequency. The numerator coeffs are so small, that you have to use 32bits for your state variables. If you do so: for a dsPIC30, there shoiuld be no problem implementing such a filter with second order error feedback (two zero's at DC). Coding it in Assembler, no significant resources are wastet. Gerold, idsa _____________________________ Posted through www.DSPRelated.com