DSPRelated.com
Forums

Round off spectrum spikes in 1/2 LSB rounding when performing fixed point filtering

Started by Zhi Ping Ang March 15, 2011
Randy Yates wrote:
> On 03/15/2011 09:55 PM, Zhi Ping Ang wrote: > > I have a bank of 8th order digital bandpass filters with center > > frequencies between 100 Hz to 2.3 kHz at a sampling frequency of 8 > > kHz. Each of these is implemented by cascading 4 biquad filters in > > direct form I. Fixed point computation is used, which means each > > product term is rounded to a fixed word length before being > > accumulated to give the output value of the filter. Filter > > coefficients are also stored using the same word length. > > > > I experimented using truncation rounding and the usual rounding up/ > > down. The former incurs a maximum error of 1 LSB whereas the latter > > incurs at most 1/2 LSB error. Here comes the problem. When I analysed > > the magnitude spectrum of the filter which uses 1/2 LSB rounding > > (http://dhost.info/angzhiping/files/round.jpg), it has more spikes > > than the one which uses truncation rounding (http://dhost.info/ > > angzhiping/files/truncate.jpg). Although I only illustrate a word > > length of 18-bits, the spike locations are consistent and grows as the > > word length precision gradually decreases. > > > > I would like to know the origin of these spikes. Thank you! > > Hi Zhi, > > As Robert said, you shouldn't be rounding before accumulating. I > suggest you fix that problem first, then take another look at the > responses.
Modified such that the partial products are evaluated and accumulated to the full precision before rounding. Indeed the amount of spikes is reduced, but they did not go away. For the benefit of comp.dsp you may access the entire set of spectral plots which compares between the two methods of computations across 3 methods of rounding (truncation, normal rounding and convergent rounding) across varying fractional precision between 10 to 31 bits at: http://dhost.info/angzhiping/files/spectrum/. The file format is as follows: <method>_ch5_fl100_fs8000_i2_<fractional precision>_<rounding mode>.pdf. <method>: M1 - premature rounding of partial products, M4 - full precision product evaluation and accumulation before rounding. <fractional precision>: Number of precision bits used in the fractional field. <rounding mode>: r0 - truncation, r1 - normal rounding, r2 - convergent rounding, where ties are rounded to nearest even.
> -- > Randy Yates Digital Signal Labs > 919-577-9882 http://www.digitalsignallabs.com > yates@digitalsignallabs.com
Vladimir Vassilevsky wrote:
> Zhi Ping Ang wrote: > > > I have a bank of 8th order digital bandpass filters with center > > frequencies between 100 Hz to 2.3 kHz at a sampling frequency of 8 > > kHz. Each of these is implemented by cascading 4 biquad filters in > > direct form I. Fixed point computation is used, which means each > > product term is rounded to a fixed word length before being > > accumulated to give the output value of the filter. Filter > > coefficients are also stored using the same word length. > > > > I experimented using truncation rounding and the usual rounding up/ > > down. The former incurs a maximum error of 1 LSB whereas the latter > > incurs at most 1/2 LSB error. Here comes the problem. When I analysed > > the magnitude spectrum of the filter which uses 1/2 LSB rounding > > (http://dhost.info/angzhiping/files/round.jpg), it has more spikes > > than the one which uses truncation rounding (http://dhost.info/ > > angzhiping/files/truncate.jpg). Although I only illustrate a word > > length of 18-bits, the spike locations are consistent and grows as the > > word length precision gradually decreases. > > > > I would like to know the origin of these spikes. Thank you! > > Perhaps, you are observing the artifacts of the limit cycles in the > filters. Using rounding instead of truncation increases the limit cycle > behavior. Think of rounding/truncation as if it is a center notch step > function; rounding has more "gain". >
If so, is there an analytical method to deduce the locations and magnitude of such spikes?
> > Vladimir Vassilevsky > DSP and Mixed Signal Design Consultant > http://www.abvolt.com
Impoliticus wrote:
> This may or may not be applicable. What sort of rounding are you using in > the case of a tie (i.e. remainder = +/- 0.5)? If a remainder of +0.5 is > rounded up AND a remainder of -0.5 is rounded up, you get a DC bias -- this > would show up as a spectral line. Unbiased rounding would round +0.5 up > and -0.5 down, i.e. 12.5 rounds to 13 and -12.5 rounds to -13.
I assuming you are pointing to the use of convergent rounding which takes into account of this DC biasing by executing a banker's rounding. Convergent rounding also creates spikes, in fact, more spiker than truncation rounding. You may refer to two plots which uses 18-bits word precision: truncation - http://dhost.info/angzhiping/files/spectrum/M4_ch5_fl100_fs8000_i2_f16_r0.pdf, convergent - http://dhost.info/angzhiping/files/spectrum/M4_ch5_fl100_fs8000_i2_f16_r2.pdf.

Zhi Ping Ang wrote:
> Vladimir Vassilevsky wrote: > >>Zhi Ping Ang wrote: >> >> >>>I have a bank of 8th order digital bandpass filters with center >>>frequencies between 100 Hz to 2.3 kHz at a sampling frequency of 8 >>>kHz. Each of these is implemented by cascading 4 biquad filters in >>>direct form I. Fixed point computation is used, which means each >>>product term is rounded to a fixed word length before being >>>accumulated to give the output value of the filter. Filter >>>coefficients are also stored using the same word length. >>> >>>I experimented using truncation rounding and the usual rounding up/ >>>down. The former incurs a maximum error of 1 LSB whereas the latter >>>incurs at most 1/2 LSB error. Here comes the problem. When I analysed >>>the magnitude spectrum of the filter which uses 1/2 LSB rounding >>>(http://dhost.info/angzhiping/files/round.jpg), it has more spikes >>>than the one which uses truncation rounding (http://dhost.info/ >>>angzhiping/files/truncate.jpg). Although I only illustrate a word >>>length of 18-bits, the spike locations are consistent and grows as the >>>word length precision gradually decreases. >>> >>>I would like to know the origin of these spikes. Thank you! >> >>Perhaps, you are observing the artifacts of the limit cycles in the >>filters. Using rounding instead of truncation increases the limit cycle >>behavior. Think of rounding/truncation as if it is a center notch step >>function; rounding has more "gain". >>
> If so, is there an analytical method to deduce the locations and > magnitude of such spikes?
There is analytical approach; however simulation is more practical. A cheap method to combat limit cycles in IIR is introduce a center notch nonlinearity or (better) noise gate into a filter. Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
On Mar 20, 11:26&#4294967295;pm, Vladimir Vassilevsky <nos...@nowhere.com> wrote:
> > A cheap method to combat limit cycles in IIR is introduce a center notch > nonlinearity or (better) noise gate into a filter.
Vlad, i wonder if center notch (do you mean a dead zone wider than 2 LSB or one of exactly 2 LSB width that you get with the "round toward zero" mode?) will be cheaper than fraction saving. maybe with IEEE floating point (which is sign-magnitude), but with anything 2's complement (and i think nearly any fixed-point system would be 2's comp), you would have to test the sign and round up or down depending on which. fraction saving is cheap, it stops any limit cycle that gets stuck on non-zero DC. and a gate is sorta like a wide center notch, at least if it acts instantaneously. otherwise it's even more expensive. r b-j

robert bristow-johnson wrote:

> On Mar 20, 11:26 pm, Vladimir Vassilevsky <nos...@nowhere.com> wrote: > >>A cheap method to combat limit cycles in IIR is introduce a center notch >>nonlinearity or (better) noise gate into a filter. > > > Vlad, i wonder if center notch (do you mean a dead zone wider than 2 > LSB or one of exactly 2 LSB width that you get with the "round toward > zero" mode?) will be cheaper than fraction saving.
1. That depends on either it is done in the hardware or in the software. 2. One notch/gate could be used to cut off the whole chain of biquads. 3. Noise shaping does improve the situation but it does not kill the limit cycles completely. maybe with IEEE
> floating point (which is sign-magnitude), but with anything 2's > complement (and i think nearly any fixed-point system would be 2's > comp), you would have to test the sign and round up or down depending > on which. fraction saving is cheap, it stops any limit cycle that > gets stuck on non-zero DC. > > and a gate is sorta like a wide center notch, at least if it acts > instantaneously. otherwise it's even more expensive.
Vladimir Vassilevsky DSP and Mixed Signal Design Consultant http://www.abvolt.com
On 03/20/2011 09:00 PM, Zhi Ping Ang wrote:
> [...] > Modified such that the partial products are evaluated and accumulated > to the full precision before rounding. Indeed the amount of spikes is > reduced, but they did not go away.
Hi Zhi, In that case, I would dither the accumulated values before rounding. That should randomize and/or kill any limit-cycling. -- Randy Yates Digital Signal Labs 919-577-9882 http://www.digitalsignallabs.com yates@digitalsignallabs.com
On Mar 17, 12:36&#4294967295;am, robert bristow-johnson
<r...@audioimagination.com> wrote:

*Snip*

> now the accumulator is *not* cleared (it has y1 left in it, but we had > subtracted 1 from a1 that causes y1 to be subtracted so it's the old > "add and subtract the same quantity" trick). > > but what happened in the low-order word of the accumulator? &#4294967295;the bits > that were dropped by the truncation are added right back again. &#4294967295;we > saved the fraction and added that back in the accumulator for the > following sample. &#4294967295;that will give you infinite S/N at DC (it puts a > zero at z=1 in the quantization-noise-to-output transfer function). > > quite effective and simple. > > the above is optimal only for single-section biquad filters. &#4294967295;for your > multi-section biquads, somehow you have to structure it so that y1 and > y2 of the current section are used for x1 and x2 of the following > section. > > r b-j
I'm using these in a cascade manner. Why is it not enough to just use y1 as an input to the next stage? ZP
On Apr 4, 2:52&#4294967295;am, Zhi Ping Ang <angzhip...@gmail.com> wrote:
> On Mar 17, 12:36&#4294967295;am, robert bristow-johnson > > <r...@audioimagination.com> wrote: > > *Snip* > > > > > now the accumulator is *not* cleared (it has y1 left in it, but we had > > subtracted 1 from a1 that causes y1 to be subtracted so it's the old > > "add and subtract the same quantity" trick). > > > but what happened in the low-order word of the accumulator? &#4294967295;the bits > > that were dropped by the truncation are added right back again. &#4294967295;we > > saved the fraction and added that back in the accumulator for the > > following sample. &#4294967295;that will give you infinite S/N at DC (it puts a > > zero at z=1 in the quantization-noise-to-output transfer function). > > > quite effective and simple. > > > the above is optimal only for single-section biquad filters. &#4294967295;for your > > multi-section biquads, somehow you have to structure it so that y1 and > > y2 of the current section are used for x1 and x2 of the following > > section. > > I'm using these in a cascade manner. Why is it not enough to just use > y1 as an input to the next stage?
it's sufficient, but wasteful. for a single biquad in Direct Form 1, you have: y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2] it's non-canonical because it's 2nd order, but you have more than two states. you have for states: x[n-1] x[n-2] y[n-1] y[n-2] for each i-th cascaded stage you have: y_i[n] = b0*x_i[n] + b1*x_i[n-1] + b2*x_i[n-2] - a1*y_i[n-1] - a2*y_i[n-2] where the "_i" notation means a subscript of "i". now we defined the input to the (i+1)th stage to be the output of the i-th stage: x_(i+1)[n] = y_i[n] but that means that x_(i+1)[n-1] = y_i[n-1] x_(i+1)[n-2] = y_i[n-2] so the two input states of the (i+1)th stage are exactly the same numbers as the two output states of the i-th stage. if you have I stages (and 2*I order filter) in cascade, using Direct Form 1 will not require 4*I states as one might naively conclude, it requires 2*I + 2 states, only 2 more states for the whole thing than a canonical form (DF2) would require. but, if you have a double-wide accumulator, the DF1 has serious advantages over the DF2, namely that of avoidance of clipping from the highly resonant poles of each section. (there are well-known tricks or rules of thumb you should use in grouping the zeros with the poles for each stage so that there is not a nasty gain and clipping in any single stage.) r b-j
On Apr 4, 11:42&#4294967295;am, robert bristow-johnson <r...@audioimagination.com>
wrote:
> On Apr 4, 2:52&#4294967295;am, Zhi Ping Ang <angzhip...@gmail.com> wrote: > > > > > > > > > > > On Mar 17, 12:36&#4294967295;am, robert bristow-johnson > > > <r...@audioimagination.com> wrote: > > > *Snip* > > > > now the accumulator is *not* cleared (it has y1 left in it, but we had > > > subtracted 1 from a1 that causes y1 to be subtracted so it's the old > > > "add and subtract the same quantity" trick). > > > > but what happened in the low-order word of the accumulator? &#4294967295;the bits > > > that were dropped by the truncation are added right back again. &#4294967295;we > > > saved the fraction and added that back in the accumulator for the > > > following sample. &#4294967295;that will give you infinite S/N at DC (it puts a > > > zero at z=1 in the quantization-noise-to-output transfer function). > > > > quite effective and simple. > > > > the above is optimal only for single-section biquad filters. &#4294967295;for your > > > multi-section biquads, somehow you have to structure it so that y1 and > > > y2 of the current section are used for x1 and x2 of the following > > > section. > > > I'm using these in a cascade manner. Why is it not enough to just use > > y1 as an input to the next stage? > > it's sufficient, but wasteful. > > for a single biquad in Direct Form 1, you have: > > &#4294967295; &#4294967295;y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2] > > it's non-canonical because it's 2nd order, but you have more than two > states. &#4294967295;you have for states: > > &#4294967295; &#4294967295;x[n-1] > &#4294967295; &#4294967295;x[n-2] > &#4294967295; &#4294967295;y[n-1] > &#4294967295; &#4294967295;y[n-2] > > for each i-th cascaded stage you have: > > &#4294967295; &#4294967295;y_i[n] = b0*x_i[n] + b1*x_i[n-1] + b2*x_i[n-2] > &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; &#4294967295; - a1*y_i[n-1] - a2*y_i[n-2] > > where the "_i" notation means a subscript of "i". &#4294967295;now we defined the > input to the (i+1)th stage to be the output of the i-th stage: > > &#4294967295; &#4294967295;x_(i+1)[n] &#4294967295;= y_i[n] > > but that means that > > &#4294967295; &#4294967295;x_(i+1)[n-1] &#4294967295;= y_i[n-1] > &#4294967295; &#4294967295;x_(i+1)[n-2] &#4294967295;= y_i[n-2] > > so the two input states of the (i+1)th stage are exactly the same > numbers as the two output states of the i-th stage. > > if you have I stages (and 2*I order filter) in cascade, using Direct > Form 1 will not require 4*I states as one might naively conclude, it > requires 2*I + 2 states, only 2 more states for the whole thing than a > canonical form (DF2) would require. &#4294967295;but, if you have a double-wide > accumulator, the DF1 has serious advantages over the DF2, namely that > of avoidance of clipping from the highly resonant poles of each > section. &#4294967295;(there are well-known tricks or rules of thumb you should > use in grouping the zeros with the poles for each stage so that there > is not a nasty gain and clipping in any single stage.) > > r b-j
Got it, thanks. Talking about clipping, I have taken the approach of dividing the numerator coefficients (i.e. the coefficients of Y(z) of H(z) = Y(z)/ X(z)) of the original biquad filter coefficients, meant to be implemented using floating point arithmetic, by the absolute sum of the impulse response, i.e. sum(|h[n]|). I read this somewhere that by doing so, given the input data x[n] is within +/-1, the output is guaranteed to be +/- 1 too, hence avoiding the problem of overflowing that occurs in fixed point implementations. But the disadvantage is that the filter's spectrum is shifted down, hence the output signal is somehow attenuated. For a Butterworth biquad, the DC gain is approximately 0.9. So although the problem of overflow has been solved, I'm losing precision if I cascaded a series of such fixed point filters. Short of using floating point arithmetic, is there a way to avoid this problem?