# saturation in iir filter

Started by April 1, 2008
```Hi all,

i am currently implementing an iir filter on a arm processor that doesn't
have a fractionnal multiplication.
As i have read on this forum i divided my filter coefficient (bo b1 b2 a1
a2) by 2 if they are >1 to be able to perform a q1.15 by q1.15
multiplication
the results of multiplications are therefore q2.30.
I accumulate those multiplication in a 64 bit register , multiply the
result by 2 to compensate for coeffs normalization

ie: accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] +
a2\2*y[n-2]
accum += accum; (to compensate for dividing by 2 the coefficient)

then i round and saturate the result the following way:
if accum > 0x3fffffff
result = 0x7fff (going back to a 16 bit value)
else if accum < 0x80000000
result = 0x8000
else
result = (accu +(1<<15)) >>15 (rouding and going back from q2.30 to
q1.15)

i am then storing this result in the memory line as y[n-1]

my question is the following: my input signal has a very big dynamic range
(many samples above 30000 or below -30000) and when i am filtering it with
a high pass filter (fc = 300hz) i get an output signal with a lot of
saturation which produces hearable artifacts: i am doing something wrong
in my calculation or is the only solution to scale down the input signal
or to divided the b's coefficient by some constant to avoid saturation?

read on this forum on the subject :-) )

```
```
marianne wrote:
> Hi all,
>
> i am currently implementing an iir filter on a arm processor that doesn't
> have a fractionnal multiplication.
> As i have read on this forum i divided my filter coefficient (bo b1 b2 a1
> a2) by 2 if they are >1 to be able to perform a q1.15 by q1.15
> multiplication
> the results of multiplications are therefore q2.30.
> I accumulate those multiplication in a 64 bit register , multiply the
> result by 2 to compensate for coeffs normalization
>
> ie: accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] +
> a2\2*y[n-2]

So the accum is computed as the 32 bit number and then casted to 64
bits, huh.

> accum += accum; (to compensate for dividing by 2 the coefficient)
>
> then i round and saturate the result the following way:
> if accum > 0x3fffffff
^^^^^^^^^^^^^^^^^^^^^^^^

?????

if(accum > 0x80000000)

>  result = 0x7fff (going back to a 16 bit value)
> else if accum < 0x80000000
>  result = 0x8000

You need to try harder.

DSP and Mixed Signal Design Consultant
http://www.abvolt.com
```
```Hello Marianne

Please give the filter coefficients also. So we
can see that the coefficients of the feedback
section is implemented correctly.

I suspect that there is a sign error while implementing
feedback coefficients, and that might be causing multiple
rollovers.

Regards
Bharat Pathak

Arithos Designs
www.Arithos.com

```
```Sorry i did many typing errors:

my accumulation equation is
accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] -
a2\2*y[n-2]

(instead of the +a2\2*y[n-2] i typed before)

and in my particular case coefficients are b0 = 0.941266298 b1 =
-1.8825326 b2 = 0.941266298 a1 = -1.879089 a2 = 0.885975957

another typing error is that i accumulate on a 32 bits register as the
q2.30 format is enough to take the temporary one bit growth that sometimes
happen during accumulation.

and finally i multiply the accum value by 2 (accum = accum + accum) and
then compare it to the max values of my q2.30 format ie 0x3fffffff and
0xC0000000 to check if i have an overflow or not before  rouding and
scaling down to q1.15: ie
if (accum > 0x3fffffff) then i saturate my 16bits result to 0x7fff
if (accum < 0xC0000000) then i saturate my 16bits result to 0x8000
else i round it and shift it by 15 to get back to a q1.15

maybe Vladirmir that's where i got it wrong? for me the accumulation was
still in the q2.30 format.
```
```Sorry i did many typing errors:

my accumulation equation is
accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] -
a2\2*y[n-2]

(instead of the +a2\2*y[n-2] i typed before)

and in my particular case coefficients are b0 = 0.941266298 b1 =
-1.8825326 b2 = 0.941266298 a1 = -1.879089 a2 = 0.885975957

another typing error is that i accumulate on a 32 bits register as the
q2.30 format is enough to take the temporary one bit growth that sometimes
happen during accumulation.

and finally i multiply the accum value by 2 (accum = accum + accum) and
then compare it to the max values of my q2.30 format ie 0x3fffffff and
0xC0000000 to check if i have an overflow or not before  rouding and
scaling down to q1.15: ie
if (accum > 0x3fffffff) then i saturate my 16bits result to 0x7fff
if (accum < 0xC0000000) then i saturate my 16bits result to 0x8000
else i round it and shift it by 15 to get back to a q1.15

maybe Vladirmir that's where i got it wrong? for me the accumulation was
still in the q2.30 format.
```
```Sorry i did many typing errors:

my accumulation equation is
accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] -
a2\2*y[n-2]

(instead of the +a2\2*y[n-2] i typed before)

and in my particular case coefficients are b0 = 0.941266298 b1 =
-1.8825326 b2 = 0.941266298 a1 = -1.879089 a2 = 0.885975957

another typing error is that i accumulate on a 32 bits register as the
q2.30 format is enough to take the temporary one bit growth that sometimes
happen during accumulation.

and finally i multiply the accum value by 2 (accum = accum + accum) and
then compare it to the max values of my q2.30 format ie 0x3fffffff and
0xC0000000 to check if i have an overflow or not before  rouding and
scaling down to q1.15: ie
if (accum > 0x3fffffff) then i saturate my 16bits result to 0x7fff
if (accum < 0xC0000000) then i saturate my 16bits result to 0x8000
else i round it and shift it by 15 to get back to a q1.15

maybe Vladirmir that's where i got it wrong? for me the accumulation was
still in the q2.30 format.
```
```Hi marianne,

"marianne" <mary1331@hotmail.com> writes:
> [...]
> and finally i multiply the accum value by 2 (accum = accum + accum)

If your machine doesn't have a saturation mode, then this step is
absolutely pointless and potentially harmful. You never want to shift a
final result left (which is what multiplying by 2 does) since you risk
overflowing and there is no reason to do it.

Also, depending on the reason for the shift, you either a) change the
scaling of the result, or b) change the gain of the overall filter. You
almost certainly don't want to do either of these at this stage.
Additionally, if you did want to perform b) for some reason, you
can perform the function "virtually" by simply reinterpreting the
scaling without risking an overflow.

> and then compare it to the max values of my q2.30 format ie 0x3fffffff
> and 0xC0000000 to check if i have an overflow or not before rouding
> and scaling down to q1.15: ie if (accum > 0x3fffffff) then i saturate
> my 16bits result to 0x7fff if (accum < 0xC0000000) then i saturate my
> 16bits result to 0x8000 else i round it and shift it by 15 to get back
> to a q1.15

What I would do, instead, is first shift the accumulator right by 15
(since this changes the scaling to q17.15), then perform your saturation
logic using 0x00007FFF and 0xFFFF8000 and store the least-significant 16
bits of the accumulator, resulting in q(17-16).15 = q1.15. In order to
round, add 0x00004000 to the accumulator before shifting right by 15.
If you want to protect the round operation from overflow, shift the
accumulator right 1 first, then add 0x00002000, and then shift right by
14 and perform the saturation logic.

This is assuming your machine doesn't have built-in saturation
abilities. If it did, then you would do things quite a bit differently
in order to make use of them and save some cycles.
--
%  Randy Yates                  % "Midnight, on the water...
%% Fuquay-Varina, NC            %  I saw...  the ocean's daughter."
%%% 919-577-9882                % 'Can't Get It Out Of My Head'
%%%% <yates@ieee.org>           % *El Dorado*, Electric Light Orchestra
http://www.digitalsignallabs.com
```
```Thanks for your answer Randy

>If your machine doesn't have a saturation mode, then this step is
>absolutely pointless and potentially harmful. You never want to shift a
>final result left (which is what multiplying by 2 does) since you risk
>overflowing and there is no reason to do it.

my machine doesn't have  a saturation mode.
I was doing this multiplication by two of my accumulator because i read
the following from rbj on this forum on this issue of having coeffs out of
the range [-1 1[

"""""""""""""""""""""""""""""""""""""""""""""""""""""
y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]

you may not scale the a1 or a2 coefficients according to any other
criteria.
you *may* scale the numerator coefficients (b0, b1, b2) by a common
factor
and, except for quantizing or overflow issues, that will change only the
level of the output.

now since a1 has a range that exceeds what you can put in a 1.15 number
format, then what is usually done is that all 5 coefficients are
represented
with 1/2 of their correct value and the resulting sum is doubled
(sometimes
with an ASL instruction) before it is outputed and saved as a state. that
is:

y[n] = 2*(b0/2*x[n] + b1/2*x[n-1] + b2/2*x[n-2] - a1/2*y[n-1] -
a2/2*y[n-2])

so the coefs you put into the DSP are b0/2, b1/2, b2/2, a1/2, a2/2, and
the
result of the sum is 1/2 of what the output is until you multiply that by
2
(usually with an ASL instruction). this is not solely a scaling issue
because if you don't double the resulting sum, the y[n-1] and y[n-2]
states
will not be correct and your filter will not have the shape that you
designed for.
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

I do fully agree that multiplicating the accumulation result by 2 will
cause some overflow but from what RBJ wrote i thought i had to this to
keep the shape of my filter as i intended to be when calculating the
coefficient s at the first place.

One of my point was also not to loose any dynamic in my output signal
compares to the dynamic of my input signal but then should i understand it
is just not possible to save the dynamic range AND to avoid overflow?

many thanks again for your help

```
```"marianne" <mary1331@hotmail.com> writes:

You're very welcome, marianne.

>>If your machine doesn't have a saturation mode, then this step is
>>absolutely pointless and potentially harmful. You never want to shift a
>>final result left (which is what multiplying by 2 does) since you risk
>>overflowing and there is no reason to do it.
>
> my machine doesn't have  a saturation mode.
> I was doing this multiplication by two of my accumulator because i read
> the following from rbj on this forum on this issue of having coeffs out of
> the range [-1 1[
>
> """""""""""""""""""""""""""""""""""""""""""""""""""""
> y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]
>
>
> you may not scale the a1 or a2 coefficients according to any other
> criteria.
> you *may* scale the numerator coefficients (b0, b1, b2) by a common
> factor
> and, except for quantizing or overflow issues, that will change only the
> level of the output.
>
> now since a1 has a range that exceeds what you can put in a 1.15 number
> format, then what is usually done is that all 5 coefficients are
> represented
> with 1/2 of their correct value and the resulting sum is doubled
> (sometimes
> with an ASL instruction) before it is outputed and saved as a state. that
> is:
>
>
> y[n] = 2*(b0/2*x[n] + b1/2*x[n-1] + b2/2*x[n-2] - a1/2*y[n-1] -
> a2/2*y[n-2])
>
>
> so the coefs you put into the DSP are b0/2, b1/2, b2/2, a1/2, a2/2, and
> the
> result of the sum is 1/2 of what the output is until you multiply that by
> 2
> (usually with an ASL instruction). this is not solely a scaling issue
> because if you don't double the resulting sum, the y[n-1] and y[n-2]
> states
> will not be correct and your filter will not have the shape that you
> designed for.
>""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Aha!

> I do fully agree that multiplicating the accumulation result by 2 will
> cause some overflow but from what RBJ wrote i thought i had to this to
> keep the shape of my filter as i intended to be when calculating the
> coefficient s at the first place.

OK, I see now. Then you do what I mentioned previously: multiply by two
"virtually":

If the accumulator result is q2.30, then to multiply by two, simply
reinterpret the scaling as q3.29. you don't really do anything
physically, it's just a reinterpretation. As a simple example, let's
say the accumulator result was "1.0", or 0x40000000 in q(2,30).
But 0x4000000 in q(3,29) is "2.0". Voila: we've multiplied by two,
but we didn't overflow anything or lose any precision.

What I would do, then, is first shift the accumulator right by 14
(since this changes the scaling to q17.15), then perform your saturation
logic using 0x00007FFF and 0xFFFF8000 and store the least-significant 16
bits of the accumulator, resulting in q(17-16).15 = q1.15. In order to
round, add 0x00002000 to the accumulator before shifting right by 14.
If you want to protect the round operation from overflow, shift the
accumulator right 1 first, then add 0x00001000, and then shift right by
13 and perform the saturation logic.

> One of my point was also not to loose any dynamic in my output signal
> compares to the dynamic of my input signal but then should i understand it
> is just not possible to save the dynamic range AND to avoid overflow?

That may be true - I'd have to examine your filter coefficients more
closely to say for sure.

But, as a simple thought experiment, consider a very simple 1-tap FIR
filter h[0] = 2.0 (i.e., a gain of two). Then if you want the output
samples to be scaled the same as the input samples, you're going to
necessarily lose dynamic range, since many samples in your input signal
that were originally in range will saturate the "filter" output.

A similar thing goes for your IIR. If, for example, your IIR has a gain
of > 1 in some band, and you input a full-scale sine wave in that band,
then it's going to clip (saturate) at the output, assuming you keep the
scaling of the output samples the same as that of the input samples.
--
%  Randy Yates                  % "I met someone who looks alot like you,
%% Fuquay-Varina, NC            %             she does the things you do,
%%% 919-577-9882                %                     but she is an IBM."
%%%% <yates@ieee.org>           %        'Yours Truly, 2095', *Time*, ELO
http://www.digitalsignallabs.com
```
```>If the accumulator result is q2.30, then to multiply by two, simply
>reinterpret the scaling as q3.29. you don't really do anything
>physically, it's just a reinterpretation. As a simple example, let's
>say the accumulator result was "1.0", or 0x40000000 in q(2,30).
>But 0x4000000 in q(3,29) is "2.0". Voila: we've multiplied by two,
>but we didn't overflow anything or lose any precision.
>
>What I would do, then, is first shift the accumulator right by 14
>(since this changes the scaling to q17.15), then perform your saturation
>logic using 0x00007FFF and 0xFFFF8000 and store the least-significant 16
>bits of the accumulator, resulting in q(17-16).15 = q1.15. In order to
>round, add 0x00002000 to the accumulator before shifting right by 14.
>If you want to protect the round operation from overflow, shift the
>accumulator right 1 first, then add 0x00001000, and then shift right by
>13 and perform the saturation logic.

Let's take an example that really happen in my filter:
my accumulator value (in q2.30 before considering it as q3.29) is
0x21ca49a7: it is not over the max of q2.30, fine.
Now i do what you suggest above add 0x00002000 -> 0x21ca69a7
shift rigth by 14 -> 0x00008729 is over 0x00007fff so my q1.15 output
result is 0x7fff => ok i am saturating

>That may be true - I'd have to examine your filter coefficients more
>closely to say for sure.

the coefficients are b0 = 0.941266298 b1 =
-1.8825326 b2 = 0.941266298 a1 = -1.879089 a2 = 0.885975957

>
>But, as a simple thought experiment, consider a very simple 1-tap FIR
>filter h[0] = 2.0 (i.e., a gain of two). Then if you want the output
>samples to be scaled the same as the input samples, you're going to
>necessarily lose dynamic range, since many samples in your input signal
>that were originally in range will saturate the "filter" output.
>
>A similar thing goes for your IIR. If, for example, your IIR has a gain
>of > 1 in some band, and you input a full-scale sine wave in that band,
>then it's going to clip (saturate) at the output, assuming you keep the
>scaling of the output samples the same as that of the input samples.

I agree with that idea but in my present case, my filter is a high pass
filter and it doesn't seem to me that the gain is ever > 1 (i might be
wrong on this)
that's where my un-understanding lies: how can i saturate if my filter
gain is never >1...
```