DSPRelated.com
Forums

IIR with fixed-point numbers?

Started by Elektro May 24, 2005
What�s the best way to determine the required number of bits necessary in a
fixed-point IIR-filter? To avoid overflow during calculations.


"Elektro" <blabla@bredband.net> wrote in
news:4293a2ac$0$79465$14726298@news.sunsite.dk: 

> What&#4294967295;s the best way to determine the required number of bits necessary > in a fixed-point IIR-filter? To avoid overflow during calculations. > > >
The number of bits needed in an IIR filter depends on many factors including the coefficients of the filter, the desired performance of the filter and the implementation of the filter. In many cases a basic 16 bit biquad implementation may be fine. In other cases you may want 32 bit or better math with first order error shaping. Here are a couple of guidelines: 1. Use cascaded second order sections. 2. Use a Direct Form I implementation (or sometimes Lattice) First order error shaping is almost free with this DF I as well. 3. If you are trying to create low frequency, high Q filters with a high sample rate, you will need very good precision. In most cases, floating point IIRs are not as good as the right fixed point implementation. Multiplying two floating point numbers is good, but adding a large float to a small float is bad. With fixed point you can take advantage of a large accumulator which is usually twice the word width plus guard bits, for example 80 bits in a SHARC. With floating point the result of a multiply is the same length and precision as the inputs to the multiplier. There are some excellent papers published on this topic in the AES Journal. Mark Allie gave a very good presentation at last years comp.dsp conference on this topic. -- Al Clark Danville Signal Processing, Inc. -------------------------------------------------------------------- Purveyors of Fine DSP Hardware and other Cool Stuff Available at http://www.danvillesignal.com
Thank you



But what I'm after is more how I'm going to predict how large numbers I'm
going to have to deal with inside the filter.



For example if I'm sending in 40 Hz sinus at an amplitude of 8000000 it's
going to generate amplitudes of 3000000000 at some node in the filter. That'
s what I'm after, a way to predict how big these numbers can get at worst
case.


"Al Clark" <dsp@danvillesignal.com> skrev i meddelandet
news:Xns9660B1FC041E2aclarkdanvillesignal@66.133.129.71...
> "Elektro" <blabla@bredband.net> wrote in > news:4293a2ac$0$79465$14726298@news.sunsite.dk: > > > What's the best way to determine the required number of bits necessary > > in a fixed-point IIR-filter? To avoid overflow during calculations. > > > > > > > > The number of bits needed in an IIR filter depends on many factors > including the coefficients of the filter, the desired performance of the > filter and the implementation of the filter. > > In many cases a basic 16 bit biquad implementation may be fine. In other > cases you may want 32 bit or better math with first order error shaping. > > Here are a couple of guidelines: > > 1. Use cascaded second order sections. > 2. Use a Direct Form I implementation (or sometimes Lattice) First order > error shaping is almost free with this DF I as well. > 3. If you are trying to create low frequency, high Q filters with a high > sample rate, you will need very good precision. > > In most cases, floating point IIRs are not as good as the right fixed > point implementation. Multiplying two floating point numbers is good, but > adding a large float to a small float is bad. > > With fixed point you can take advantage of a large accumulator which is > usually twice the word width plus guard bits, for example 80 bits in a > SHARC. With floating point the result of a multiply is the same length > and precision as the inputs to the multiplier. > > There are some excellent papers published on this topic in the AES > Journal. Mark Allie gave a very good presentation at last years comp.dsp > conference on this topic. > > > > -- > Al Clark > Danville Signal Processing, Inc. > -------------------------------------------------------------------- > Purveyors of Fine DSP Hardware and other Cool Stuff > Available at http://www.danvillesignal.com
Elektro wrote:
> Thank you > > > > But what I'm after is more how I'm going to predict how large numbers I'm > going to have to deal with inside the filter. > > > > For example if I'm sending in 40 Hz sinus at an amplitude of 8000000 it's > going to generate amplitudes of 3000000000 at some node in the filter. That' > s what I'm after, a way to predict how big these numbers can get at worst > case.
You can be certain that they will be bigger than is manageable unless you scale your quantities initially and probably as the calculation proceeds. What you propose to do needs techniques that aren't obvious to people who think of arithmetic in paper-and-pencil terms. Floating point needs only a small departure from that mindset, but fixed point (scaled integer) is almost a different realm. It requires logic beyond simple arithmetic, and benefits from art. Randy Yates has written an excellent tutorial that lays out the groundwork and supplies some of the art. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Hello again



Now I have found a way to find the bounds for the intermediate results in
the IIR-filter to avoid overflow.



The method was in fact simple. First you calculate the impulse response for
a node in the filter and then calculate the absolute sum of it. And then
multiply the sum with the maximal input signal.



Then you know that the signal at that node never exceeds that number. After
that its easy to calculate the required number of integer bits.



This method was described in the paper 'Determining appropriate precisions
for signals in fixed-point IIR filters'.


"Al Clark" <dsp@danvillesignal.com> skrev i meddelandet
news:Xns9660B1FC041E2aclarkdanvillesignal@66.133.129.71...
> "Elektro" <blabla@bredband.net> wrote in > news:4293a2ac$0$79465$14726298@news.sunsite.dk: > > > What's the best way to determine the required number of bits necessary > > in a fixed-point IIR-filter? To avoid overflow during calculations. > > > > > > > > The number of bits needed in an IIR filter depends on many factors > including the coefficients of the filter, the desired performance of the > filter and the implementation of the filter. > > In many cases a basic 16 bit biquad implementation may be fine. In other > cases you may want 32 bit or better math with first order error shaping. > > Here are a couple of guidelines: > > 1. Use cascaded second order sections. > 2. Use a Direct Form I implementation (or sometimes Lattice) First order > error shaping is almost free with this DF I as well. > 3. If you are trying to create low frequency, high Q filters with a high > sample rate, you will need very good precision. > > In most cases, floating point IIRs are not as good as the right fixed > point implementation. Multiplying two floating point numbers is good, but > adding a large float to a small float is bad. > > With fixed point you can take advantage of a large accumulator which is > usually twice the word width plus guard bits, for example 80 bits in a > SHARC. With floating point the result of a multiply is the same length > and precision as the inputs to the multiplier. > > There are some excellent papers published on this topic in the AES > Journal. Mark Allie gave a very good presentation at last years comp.dsp > conference on this topic. > > > > -- > Al Clark > Danville Signal Processing, Inc. > -------------------------------------------------------------------- > Purveyors of Fine DSP Hardware and other Cool Stuff > Available at http://www.danvillesignal.com
Al Clark wrote:
...
> In most cases, floating point IIRs are not as good as the right fixed > point implementation.
Unless, of course, you use the right floating-point implementation. Regards, Andor
"Andor" <an2or@mailcircuit.com> wrote in news:1117700112.101160.67890
@g49g2000cwa.googlegroups.com:

> Al Clark wrote: > ... >> In most cases, floating point IIRs are not as good as the right fixed >> point implementation. > > Unless, of course, you use the right floating-point implementation. > > Regards, > Andor > >
OK, I'll bite, which floating point form? The problem with the floating point forms is that subtraction and addition kills you. For example, a big float - another big float = imprecise little float. This can be minimised to some extent by using a higher precision float such as the 40 bit version in the SHARC (32 bit mantissa & sign + 8 bit exponent) The advantage to the fixed point structure is that you have a very large accumulator to work with. Certainly many filters are fine for a variety of different requirements. I'm mostly talking about filters where the poles "kiss" the unit circle. Mark Allie gave a good presentation on this topic at last year's comp.dsp conference. -- Al Clark Danville Signal Processing, Inc. -------------------------------------------------------------------- Purveyors of Fine DSP Hardware and other Cool Stuff Available at http://www.danvillesignal.com
Andor wrote:
> Al Clark wrote: > ... > >>In most cases, floating point IIRs are not as good as the right fixed >>point implementation. > > > Unless, of course, you use the right floating-point implementation.
I think that Al means that in most cases, there will be more significant bits in "the right" fixed-point implementation. Think of fixed point as floating point with all the bits used for mantissa and the (common) exponent stored elsewhere, if only in the programmer's mind. Jerry -- Engineering is the art of making what you want from things you can get. &#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;
Al wrote:

>OK, I'll bite, which floating point form?
I'm not allowed to say which floating-point IIR architecture we (where I work) find gives the best results. Much has been written on this topic, though, and a thorough literature search should result in some pointers.
>The problem with the floating point forms is that subtraction and >addition kills you. For example, a big float - another big float = >imprecise little float. This can be minimised to some extent by using a >higher precision float such as the 40 bit version in the SHARC (32 bit >mantissa & sign + 8 bit exponent)
I've been thinking about the following little game: You and I each write a program for the SHARC DSP that computes some specifiable number of IIR biquads. You use the best available fixed-point implementation, using the 32bit / 80bit fixed-point MAC unit. I use our floating-point implementation using the 32bit / 40bit floating-point MAC unit. Each of us must use 32bit precision intermediate variables for the filter states (note that this gives an inherent drawback for floating-point, because 40bit floating-point filter states could be used with no additional cost of the CPU time, and thus the floating-point performance could be improved for free - but let's compare apples to apples). Now, each party gets to choose an arbitrary input signal (we can define the input format to be 24bit fixed-point), and an arbitrary set of biquad specifications. You choose an input signal and some biquad filter specifications where you think the fixed-point implementation shines and the floating-point implementation fails miserably, and we each run that signal through our filter routines and compare the outputs. Likewise, I get to choose a set of biquads and a test signal, and again we compare the outputs of our routines. We then measure the SNR induced by our respective routines (since we know what the theoretical output should be) and compare the results. I'm pretty sure I can ruin any fixed-point implemenation with the appropriate filter specificiations (using two or perhaps three biquads in series) and input signal, such that the floating-point implementation will perform reasonably well with the same test case (although I haven't tried this out yet). Do you think you could do the same to show the converse :-) ?
>Mark Allie gave a good presentation on this topic at last year's comp.dsp >conference.
Yes, I just read his talk on your CD (that was probably the best spend $25 in my working life!). His discussion of error feedback on DF1 and DF2 biquad architectures is very interesting. Regards, Andor
in article 1117724924.118207.181750@g14g2000cwa.googlegroups.com, Andor at
an2or@mailcircuit.com wrote on 06/02/2005 11:08:

> Al wrote: > >> OK, I'll bite, which floating point form? > > I'm not allowed to say which floating-point IIR architecture we (where > I work) find gives the best results. Much has been written on this > topic, though, and a thorough literature search should result in some > pointers. > >> The problem with the floating point forms is that subtraction and >> addition kills you. For example, a big float - another big float = >> imprecise little float. This can be minimised to some extent by using a >> higher precision float such as the 40 bit version in the SHARC (32 bit >> mantissa & sign + 8 bit exponent) > > I've been thinking about the following little game: You and I each > write a program for the SHARC DSP that computes some specifiable number > of IIR biquads. You use the best available fixed-point implementation, > using the 32bit / 80bit fixed-point MAC unit. I use our floating-point > implementation using the 32bit / 40bit floating-point MAC unit. Each of > us must use 32bit precision intermediate variables for the filter > states (note that this gives an inherent drawback for floating-point, > because 40bit floating-point filter states could be used with no > additional cost of the CPU time, and thus the floating-point > performance could be improved for free - but let's compare apples to > apples). > > Now, each party gets to choose an arbitrary input signal (we can define > the input format to be 24bit fixed-point), and an arbitrary set of > biquad specifications. You choose an input signal and some biquad > filter specifications where you think the fixed-point implementation > shines and the floating-point implementation fails miserably, and we > each run that signal through our filter routines and compare the > outputs. Likewise, I get to choose a set of biquads and a test signal, > and again we compare the outputs of our routines. > > We then measure the SNR induced by our respective routines (since we > know what the theoretical output should be) and compare the results. > I'm pretty sure I can ruin any fixed-point implemenation with the > appropriate filter specificiations (using two or perhaps three biquads > in series) and input signal, such that the floating-point > implementation will perform reasonably well with the same test case > (although I haven't tried this out yet). Do you think you could do the > same to show the converse :-) ?
Andor, with one caveat, i think that Al's fixed-point implementation (with 32-bit fixed-point signals and 80-bit accumulator) would certainly beat your floating-point implementation (with 32-bit floats and a 40-bit float accumulator) regarding measurable quantization error. the caveat would be the filters with gains that exceed 0 dB at some range of frequencies where the fixed-point would be forced to saturate. also, the floating-point *might* be better able to place some of the poles and zeros, but that is not a signal quantization problem.
>> Mark Allie gave a good presentation on this topic at last year's comp.dsp >> conference. > > Yes, I just read his talk on your CD (that was probably the best spend > $25 in my working life!). His discussion of error feedback on DF1 and > DF2 biquad architectures is very interesting.
so, even if you're doing floating-point, *don't* use DF2. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."