comp.dsp | IIR with fixed-point numbers?

What&#4294967295;s the best way to determine the required number of bits necessary in a
fixed-point IIR-filter? To avoid overflow during calculations.

Reply by Al Clark ●May 24, 20052005-05-24

"Elektro" <blabla@bredband.net> wrote in
news:4293a2ac$0$79465$14726298@news.sunsite.dk: 

> What&#4294967295;s the best way to determine the required number of bits necessary
> in a fixed-point IIR-filter? To avoid overflow during calculations.
> 
> 
> 

The number of bits needed in an IIR filter depends on many factors 
including the coefficients of the filter, the desired performance of the 
filter and the implementation of the filter.

In many cases a basic 16 bit biquad implementation may be fine. In other 
cases you may want 32 bit or better math with first order error shaping.

Here are a couple of guidelines:

1. Use cascaded second order sections.
2. Use a Direct Form I implementation (or sometimes Lattice) First order 
error shaping is almost free with this DF I as well.
3. If you are trying to create low frequency, high Q filters with a high 
sample rate, you will need very good precision.

In most cases, floating point IIRs are not as good as the right fixed 
point implementation. Multiplying two floating point numbers is good, but 
adding a large float to a small float is bad.

With fixed point you can take advantage of a large accumulator which is 
usually twice the word width plus guard bits, for example 80 bits in a 
SHARC. With floating point the result of a multiply is the same length 
and precision as the inputs to the multiplier. 

There are some excellent papers published on this topic in the AES 
Journal. Mark Allie gave a very good presentation at last years comp.dsp 
conference on this topic.  



-- 
Al Clark
Danville Signal Processing, Inc.
--------------------------------------------------------------------
Purveyors of Fine DSP Hardware and other Cool Stuff
Available at http://www.danvillesignal.com

Reply by Elektro ●May 25, 20052005-05-25

Thank you



But what I'm after is more how I'm going to predict how large numbers I'm
going to have to deal with inside the filter.



For example if I'm sending in 40 Hz sinus at an amplitude of 8000000 it's
going to generate amplitudes of 3000000000 at some node in the filter. That'
s what I'm after, a way to predict how big these numbers can get at worst
case.


"Al Clark" <dsp@danvillesignal.com> skrev i meddelandet
news:Xns9660B1FC041E2aclarkdanvillesignal@66.133.129.71...
> "Elektro" <blabla@bredband.net> wrote in
> news:4293a2ac$0$79465$14726298@news.sunsite.dk:
>
> > What's the best way to determine the required number of bits necessary
> > in a fixed-point IIR-filter? To avoid overflow during calculations.
> >
> >
> >
>
> The number of bits needed in an IIR filter depends on many factors
> including the coefficients of the filter, the desired performance of the
> filter and the implementation of the filter.
>
> In many cases a basic 16 bit biquad implementation may be fine. In other
> cases you may want 32 bit or better math with first order error shaping.
>
> Here are a couple of guidelines:
>
> 1. Use cascaded second order sections.
> 2. Use a Direct Form I implementation (or sometimes Lattice) First order
> error shaping is almost free with this DF I as well.
> 3. If you are trying to create low frequency, high Q filters with a high
> sample rate, you will need very good precision.
>
> In most cases, floating point IIRs are not as good as the right fixed
> point implementation. Multiplying two floating point numbers is good, but
> adding a large float to a small float is bad.
>
> With fixed point you can take advantage of a large accumulator which is
> usually twice the word width plus guard bits, for example 80 bits in a
> SHARC. With floating point the result of a multiply is the same length
> and precision as the inputs to the multiplier.
>
> There are some excellent papers published on this topic in the AES
> Journal. Mark Allie gave a very good presentation at last years comp.dsp
> conference on this topic.
>
>
>
> -- 
> Al Clark
> Danville Signal Processing, Inc.
> --------------------------------------------------------------------
> Purveyors of Fine DSP Hardware and other Cool Stuff
> Available at http://www.danvillesignal.com

Reply by Jerry Avins ●May 25, 20052005-05-25

Elektro wrote:
> Thank you
> 
> 
> 
> But what I'm after is more how I'm going to predict how large numbers I'm
> going to have to deal with inside the filter.
> 
> 
> 
> For example if I'm sending in 40 Hz sinus at an amplitude of 8000000 it's
> going to generate amplitudes of 3000000000 at some node in the filter. That'
> s what I'm after, a way to predict how big these numbers can get at worst
> case.

You can be certain that they will be bigger than is manageable unless 
you scale your quantities initially and probably as the calculation 
proceeds. What you propose to do needs techniques that aren't obvious to 
people who think of arithmetic in paper-and-pencil terms. Floating point 
needs only a small departure from that mindset, but fixed point (scaled 
integer) is almost a different realm. It requires logic beyond simple 
arithmetic, and benefits from art. Randy Yates has written an excellent 
tutorial that lays out the groundwork and supplies some of the art.

Jerry
-- 
Engineering is the art of making what you want from things you can get.
&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;

Reply by Elektro ●June 1, 20052005-06-01

Hello again



Now I have found a way to find the bounds for the intermediate results in
the IIR-filter to avoid overflow.



The method was in fact simple. First you calculate the impulse response for
a node in the filter and then calculate the absolute sum of it. And then
multiply the sum with the maximal input signal.



Then you know that the signal at that node never exceeds that number. After
that its easy to calculate the required number of integer bits.



This method was described in the paper 'Determining appropriate precisions
for signals in fixed-point IIR filters'.


"Al Clark" <dsp@danvillesignal.com> skrev i meddelandet
news:Xns9660B1FC041E2aclarkdanvillesignal@66.133.129.71...
> "Elektro" <blabla@bredband.net> wrote in
> news:4293a2ac$0$79465$14726298@news.sunsite.dk:
>
> > What's the best way to determine the required number of bits necessary
> > in a fixed-point IIR-filter? To avoid overflow during calculations.
> >
> >
> >
>
> The number of bits needed in an IIR filter depends on many factors
> including the coefficients of the filter, the desired performance of the
> filter and the implementation of the filter.
>
> In many cases a basic 16 bit biquad implementation may be fine. In other
> cases you may want 32 bit or better math with first order error shaping.
>
> Here are a couple of guidelines:
>
> 1. Use cascaded second order sections.
> 2. Use a Direct Form I implementation (or sometimes Lattice) First order
> error shaping is almost free with this DF I as well.
> 3. If you are trying to create low frequency, high Q filters with a high
> sample rate, you will need very good precision.
>
> In most cases, floating point IIRs are not as good as the right fixed
> point implementation. Multiplying two floating point numbers is good, but
> adding a large float to a small float is bad.
>
> With fixed point you can take advantage of a large accumulator which is
> usually twice the word width plus guard bits, for example 80 bits in a
> SHARC. With floating point the result of a multiply is the same length
> and precision as the inputs to the multiplier.
>
> There are some excellent papers published on this topic in the AES
> Journal. Mark Allie gave a very good presentation at last years comp.dsp
> conference on this topic.
>
>
>
> -- 
> Al Clark
> Danville Signal Processing, Inc.
> --------------------------------------------------------------------
> Purveyors of Fine DSP Hardware and other Cool Stuff
> Available at http://www.danvillesignal.com

Reply by Andor ●June 2, 20052005-06-02

Al Clark wrote:
...
> In most cases, floating point IIRs are not as good as the right fixed
> point implementation.

Unless, of course, you use the right floating-point implementation.

Regards,
Andor

Reply by Al Clark ●June 2, 20052005-06-02

"Andor" <an2or@mailcircuit.com> wrote in news:1117700112.101160.67890
@g49g2000cwa.googlegroups.com:

> Al Clark wrote:
> ...
>> In most cases, floating point IIRs are not as good as the right fixed
>> point implementation.
> 
> Unless, of course, you use the right floating-point implementation.
> 
> Regards,
> Andor
> 
> 

OK, I'll bite, which floating point form?

The problem with the floating point forms is that subtraction and 
addition kills you. For example, a big float - another big float = 
imprecise little float. This can be minimised to some extent by using a 
higher precision float such as the 40 bit version in the SHARC (32 bit 
mantissa & sign + 8 bit exponent) 

The advantage to the fixed point structure is that you have a very large 
accumulator to work with.

Certainly many filters are fine for a variety of different requirements. 
I'm mostly talking about filters where the poles "kiss" the unit circle.

Mark Allie gave a good presentation on this topic at last year's comp.dsp 
conference.

-- 
Al Clark
Danville Signal Processing, Inc.
--------------------------------------------------------------------
Purveyors of Fine DSP Hardware and other Cool Stuff
Available at http://www.danvillesignal.com

Reply by Jerry Avins ●June 2, 20052005-06-02

Andor wrote:
> Al Clark wrote:
> ...
> 
>>In most cases, floating point IIRs are not as good as the right fixed
>>point implementation.
> 
> 
> Unless, of course, you use the right floating-point implementation.

I think that Al means that in most cases, there will be more significant 
bits in "the right" fixed-point implementation. Think of fixed point as 
floating point with all the bits used for mantissa and the (common) 
exponent stored elsewhere, if only in the programmer's mind.

Jerry
-- 
Engineering is the art of making what you want from things you can get.
&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;&#4294967295;

Reply by Andor ●June 2, 20052005-06-02

Al wrote:

>OK, I'll bite, which floating point form?

I'm not allowed to say which floating-point IIR architecture we (where
I work) find gives the best results. Much has been written on this
topic, though, and a thorough literature search should result in some
pointers.

>The problem with the floating point forms is that subtraction and
>addition kills you. For example, a big float - another big float =
>imprecise little float. This can be minimised to some extent by using a
>higher precision float such as the 40 bit version in the SHARC (32 bit
>mantissa & sign + 8 bit exponent)

I've been thinking about the following little game: You and I each
write a program for the SHARC DSP that computes some specifiable number
of IIR biquads. You use the best available fixed-point implementation,
using the 32bit / 80bit fixed-point MAC unit. I use our floating-point
implementation using the 32bit / 40bit floating-point MAC unit. Each of
us must use 32bit precision intermediate variables for the filter
states (note that this gives an inherent drawback for floating-point,
because 40bit floating-point filter states could be used with no
additional cost of the CPU time, and thus the floating-point
performance could be improved for free - but let's compare apples to
apples).

Now, each party gets to choose an arbitrary input signal (we can define
the input format to be 24bit fixed-point), and an arbitrary set of
biquad specifications. You choose an input signal and some biquad
filter specifications where you think the fixed-point implementation
shines and the floating-point implementation fails miserably, and we
each run that signal through our filter routines and compare the
outputs. Likewise, I get to choose a set of biquads and a test signal,
and again we compare the outputs of our routines.

We then measure the SNR induced by our respective routines (since we
know what the theoretical output should be) and compare the results.
I'm pretty sure I can ruin any fixed-point implemenation with the
appropriate filter specificiations (using two or perhaps three biquads
in series) and input signal, such that the floating-point
implementation will perform reasonably well with the same test case
(although I haven't tried this out yet). Do you think you could do the
same to show the converse :-) ?

>Mark Allie gave a good presentation on this topic at last year's comp.dsp
>conference.

Yes, I just read his talk on your CD (that was probably the best spend
$25 in my working life!). His discussion of error feedback on DF1 and
DF2 biquad architectures is very interesting.

Regards,
Andor

Reply by robert bristow-johnson ●June 2, 20052005-06-02

in article 1117724924.118207.181750@g14g2000cwa.googlegroups.com, Andor at
an2or@mailcircuit.com wrote on 06/02/2005 11:08:

> Al wrote:
> 
>> OK, I'll bite, which floating point form?
> 
> I'm not allowed to say which floating-point IIR architecture we (where
> I work) find gives the best results. Much has been written on this
> topic, though, and a thorough literature search should result in some
> pointers.
> 
>> The problem with the floating point forms is that subtraction and
>> addition kills you. For example, a big float - another big float =
>> imprecise little float. This can be minimised to some extent by using a
>> higher precision float such as the 40 bit version in the SHARC (32 bit
>> mantissa & sign + 8 bit exponent)
> 
> I've been thinking about the following little game: You and I each
> write a program for the SHARC DSP that computes some specifiable number
> of IIR biquads. You use the best available fixed-point implementation,
> using the 32bit / 80bit fixed-point MAC unit. I use our floating-point
> implementation using the 32bit / 40bit floating-point MAC unit. Each of
> us must use 32bit precision intermediate variables for the filter
> states (note that this gives an inherent drawback for floating-point,
> because 40bit floating-point filter states could be used with no
> additional cost of the CPU time, and thus the floating-point
> performance could be improved for free - but let's compare apples to
> apples).
> 
> Now, each party gets to choose an arbitrary input signal (we can define
> the input format to be 24bit fixed-point), and an arbitrary set of
> biquad specifications. You choose an input signal and some biquad
> filter specifications where you think the fixed-point implementation
> shines and the floating-point implementation fails miserably, and we
> each run that signal through our filter routines and compare the
> outputs. Likewise, I get to choose a set of biquads and a test signal,
> and again we compare the outputs of our routines.
> 
> We then measure the SNR induced by our respective routines (since we
> know what the theoretical output should be) and compare the results.
> I'm pretty sure I can ruin any fixed-point implemenation with the
> appropriate filter specificiations (using two or perhaps three biquads
> in series) and input signal, such that the floating-point
> implementation will perform reasonably well with the same test case
> (although I haven't tried this out yet). Do you think you could do the
> same to show the converse :-) ?

Andor, with one caveat, i think that Al's fixed-point implementation (with
32-bit fixed-point signals and 80-bit accumulator) would certainly beat your
floating-point implementation (with 32-bit floats and a 40-bit float
accumulator) regarding measurable quantization error.  the caveat would be
the filters with gains that exceed 0 dB at some range of frequencies where
the fixed-point would be forced to saturate.  also, the floating-point
*might* be better able to place some of the poles and zeros, but that is not
a signal quantization problem.


>> Mark Allie gave a good presentation on this topic at last year's comp.dsp
>> conference.
> 
> Yes, I just read his talk on your CD (that was probably the best spend
> $25 in my working life!). His discussion of error feedback on DF1 and
> DF2 biquad architectures is very interesting.

so, even if you're doing floating-point, *don't* use DF2.


-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

Previous12 Next

IIR with fixed-point numbers?

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group