# saturation in iir filter

Started by April 1, 2008
```Hi all,

i am currently implementing an iir filter on a arm processor that doesn't
have a fractionnal multiplication.
As i have read on this forum i divided my filter coefficient (bo b1 b2 a1
a2) by 2 if they are >1 to be able to perform a q1.15 by q1.15
multiplication
the results of multiplications are therefore q2.30.
I accumulate those multiplication in a 64 bit register , multiply the
result by 2 to compensate for coeffs normalization

ie: accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] +
a2\2*y[n-2]
accum += accum; (to compensate for dividing by 2 the coefficient)

then i round and saturate the result the following way:
if accum > 0x3fffffff
result = 0x7fff (going back to a 16 bit value)
else if accum < 0x80000000
result = 0x8000
else
result = (accu +(1<<15)) >>15 (rouding and going back from q2.30 to
q1.15)

i am then storing this result in the memory line as y[n-1]

my question is the following: my input signal has a very big dynamic range
(many samples above 30000 or below -30000) and when i am filtering it with
a high pass filter (fc = 300hz) i get an output signal with a lot of
saturation which produces hearable artifacts: i am doing something wrong
in my calculation or is the only solution to scale down the input signal
or to divided the b's coefficient by some constant to avoid saturation?

read on this forum on the subject :-) )

```
```
marianne wrote:
> Hi all,
>
> i am currently implementing an iir filter on a arm processor that doesn't
> have a fractionnal multiplication.
> As i have read on this forum i divided my filter coefficient (bo b1 b2 a1
> a2) by 2 if they are >1 to be able to perform a q1.15 by q1.15
> multiplication
> the results of multiplications are therefore q2.30.
> I accumulate those multiplication in a 64 bit register , multiply the
> result by 2 to compensate for coeffs normalization
>
> ie: accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] +
> a2\2*y[n-2]

So the accum is computed as the 32 bit number and then casted to 64
bits, huh.

> accum += accum; (to compensate for dividing by 2 the coefficient)
>
> then i round and saturate the result the following way:
> if accum > 0x3fffffff
^^^^^^^^^^^^^^^^^^^^^^^^

?????

if(accum > 0x80000000)

>  result = 0x7fff (going back to a 16 bit value)
> else if accum < 0x80000000
>  result = 0x8000

You need to try harder.

DSP and Mixed Signal Design Consultant
http://www.abvolt.com
```
```Hello Marianne

Please give the filter coefficients also. So we
can see that the coefficients of the feedback
section is implemented correctly.

I suspect that there is a sign error while implementing
feedback coefficients, and that might be causing multiple
rollovers.

Regards
Bharat Pathak

Arithos Designs
www.Arithos.com

```
```Sorry i did many typing errors:

my accumulation equation is
accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] -
a2\2*y[n-2]

(instead of the +a2\2*y[n-2] i typed before)

and in my particular case coefficients are b0 = 0.941266298 b1 =
-1.8825326 b2 = 0.941266298 a1 = -1.879089 a2 = 0.885975957

another typing error is that i accumulate on a 32 bits register as the
q2.30 format is enough to take the temporary one bit growth that sometimes
happen during accumulation.

and finally i multiply the accum value by 2 (accum = accum + accum) and
then compare it to the max values of my q2.30 format ie 0x3fffffff and
0xC0000000 to check if i have an overflow or not before  rouding and
scaling down to q1.15: ie
if (accum > 0x3fffffff) then i saturate my 16bits result to 0x7fff
if (accum < 0xC0000000) then i saturate my 16bits result to 0x8000
else i round it and shift it by 15 to get back to a q1.15

maybe Vladirmir that's where i got it wrong? for me the accumulation was
still in the q2.30 format.
```
```Sorry i did many typing errors:

my accumulation equation is
accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] -
a2\2*y[n-2]

(instead of the +a2\2*y[n-2] i typed before)

and in my particular case coefficients are b0 = 0.941266298 b1 =
-1.8825326 b2 = 0.941266298 a1 = -1.879089 a2 = 0.885975957

another typing error is that i accumulate on a 32 bits register as the
q2.30 format is enough to take the temporary one bit growth that sometimes
happen during accumulation.

and finally i multiply the accum value by 2 (accum = accum + accum) and
then compare it to the max values of my q2.30 format ie 0x3fffffff and
0xC0000000 to check if i have an overflow or not before  rouding and
scaling down to q1.15: ie
if (accum > 0x3fffffff) then i saturate my 16bits result to 0x7fff
if (accum < 0xC0000000) then i saturate my 16bits result to 0x8000
else i round it and shift it by 15 to get back to a q1.15

maybe Vladirmir that's where i got it wrong? for me the accumulation was
still in the q2.30 format.
```
```Sorry i did many typing errors:

my accumulation equation is
accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] -
a2\2*y[n-2]

(instead of the +a2\2*y[n-2] i typed before)

and in my particular case coefficients are b0 = 0.941266298 b1 =
-1.8825326 b2 = 0.941266298 a1 = -1.879089 a2 = 0.885975957

another typing error is that i accumulate on a 32 bits register as the
q2.30 format is enough to take the temporary one bit growth that sometimes
happen during accumulation.

and finally i multiply the accum value by 2 (accum = accum + accum) and
then compare it to the max values of my q2.30 format ie 0x3fffffff and
0xC0000000 to check if i have an overflow or not before  rouding and
scaling down to q1.15: ie
if (accum > 0x3fffffff) then i saturate my 16bits result to 0x7fff
if (accum < 0xC0000000) then i saturate my 16bits result to 0x8000
else i round it and shift it by 15 to get back to a q1.15

maybe Vladirmir that's where i got it wrong? for me the accumulation was
still in the q2.30 format.
```
```Hi marianne,

"marianne" <mary1331@hotmail.com> writes:
> [...]
> and finally i multiply the accum value by 2 (accum = accum + accum)

If your machine doesn't have a saturation mode, then this step is
absolutely pointless and potentially harmful. You never want to shift a
final result left (which is what multiplying by 2 does) since you risk
overflowing and there is no reason to do it.

Also, depending on the reason for the shift, you either a) change the
scaling of the result, or b) change the gain of the overall filter. You
almost certainly don't want to do either of these at this stage.
Additionally, if you did want to perform b) for some reason, you
can perform the function "virtually" by simply reinterpreting the
scaling without risking an overflow.

> and then compare it to the max values of my q2.30 format ie 0x3fffffff
> and 0xC0000000 to check if i have an overflow or not before rouding
> and scaling down to q1.15: ie if (accum > 0x3fffffff) then i saturate
> my 16bits result to 0x7fff if (accum < 0xC0000000) then i saturate my
> 16bits result to 0x8000 else i round it and shift it by 15 to get back
> to a q1.15

What I would do, instead, is first shift the accumulator right by 15
(since this changes the scaling to q17.15), then perform your saturation
logic using 0x00007FFF and 0xFFFF8000 and store the least-significant 16
bits of the accumulator, resulting in q(17-16).15 = q1.15. In order to
round, add 0x00004000 to the accumulator before shifting right by 15.
If you want to protect the round operation from overflow, shift the
accumulator right 1 first, then add 0x00002000, and then shift right by
14 and perform the saturation logic.

This is assuming your machine doesn't have built-in saturation
abilities. If it did, then you would do things quite a bit differently
in order to make use of them and save some cycles.
--
%  Randy Yates                  % "Midnight, on the water...
%% Fuquay-Varina, NC            %  I saw...  the ocean's daughter."
%%% 919-577-9882                % 'Can't Get It Out Of My Head'
%%%% <yates@ieee.org>           % *El Dorado*, Electric Light Orchestra
http://www.digitalsignallabs.com
```
```Thanks for your answer Randy

>If your machine doesn't have a saturation mode, then this step is
>absolutely pointless and potentially harmful. You never want to shift a
>final result left (which is what multiplying by 2 does) since you risk
>overflowing and there is no reason to do it.

my machine doesn't have  a saturation mode.
I was doing this multiplication by two of my accumulator because i read
the following from rbj on this forum on this issue of having coeffs out of
the range [-1 1[

"""""""""""""""""""""""""""""""""""""""""""""""""""""
y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]

you may not scale the a1 or a2 coefficients according to any other
criteria.
you *may* scale the numerator coefficients (b0, b1, b2) by a common
factor
and, except for quantizing or overflow issues, that will change only the
level of the output.

now since a1 has a range that exceeds what you can put in a 1.15 number
format, then what is usually done is that all 5 coefficients are
represented
with 1/2 of their correct value and the resulting sum is doubled
(sometimes
with an ASL instruction) before it is outputed and saved as a state. that
is:

y[n] = 2*(b0/2*x[n] + b1/2*x[n-1] + b2/2*x[n-2] - a1/2*y[n-1] -
a2/2*y[n-2])

so the coefs you put into the DSP are b0/2, b1/2, b2/2, a1/2, a2/2, and
the
result of the sum is 1/2 of what the output is until you multiply that by
2
(usually with an ASL instruction). this is not solely a scaling issue
because if you don't double the resulting sum, the y[n-1] and y[n-2]
states
will not be correct and your filter will not have the shape that you
designed for.
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

I do fully agree that multiplicating the accumulation result by 2 will
cause some overflow but from what RBJ wrote i thought i had to this to
keep the shape of my filter as i intended to be when calculating the
coefficient s at the first place.

One of my point was also not to loose any dynamic in my output signal
compares to the dynamic of my input signal but then should i understand it
is just not possible to save the dynamic range AND to avoid overflow?

many thanks again for your help

```
```"marianne" <mary1331@hotmail.com> writes:

You're very welcome, marianne.

>>If your machine doesn't have a saturation mode, then this step is
>>absolutely pointless and potentially harmful. You never want to shift a
>>final result left (which is what multiplying by 2 does) since you risk
>>overflowing and there is no reason to do it.
>
> my machine doesn't have  a saturation mode.
> I was doing this multiplication by two of my accumulator because i read
> the following from rbj on this forum on this issue of having coeffs out of
> the range [-1 1[
>
>
"""""""""""""""""""""""""""""""""""""""""""""""""""""
> y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]
>
>
> you may not scale the a1 or a2 coefficients according to any other
> criteria.
> you *may* scale the numerator coefficients (b0, b1, b2) by a common
> factor
> and, except for quantizing or overflow issues, that will change only the
> level of the output.
>
> now since a1 has a range that exceeds what you can put in a 1.15 number
> format, then what is usually done is that all 5 coefficients are
> represented
> with 1/2 of their correct value and the resulting sum is doubled
> (sometimes
> with an ASL instruction) before it is outputed and saved as a state. that
> is:
>
>
> y[n] = 2*(b0/2*x[n] + b1/2*x[n-1] + b2/2*x[n-2] - a1/2*y[n-1] -
> a2/2*y[n-2])
>
>
> so the coefs you put into the DSP are b0/2, b1/2, b2/2, a1/2, a2/2, and
> the
> result of the sum is 1/2 of what the output is until you multiply that by
> 2
> (usually with an ASL instruction). this is not solely a scaling issue
> because if you don't double the resulting sum, the y[n-1] and y[n-2]
> states
> will not be correct and your filter will not have the shape that you
> designed for.
>""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Aha!

> I do fully agree that multiplicating the accumulation result by 2 will
> cause some overflow but from what RBJ wrote i thought i had to this to
> keep the shape of my filter as i intended to be when calculating the
> coefficient s at the first place.

OK, I see now. Then you do what I mentioned previously: multiply by two
"virtually":

If the accumulator result is q2.30, then to multiply by two, simply
reinterpret the scaling as q3.29. you don't really do anything
physically, it's just a reinterpretation. As a simple example, let's
say the accumulator result was "1.0", or 0x40000000 in q(2,30).
But 0x4000000 in q(3,29) is "2.0". Voila: we've multiplied by two,
but we didn't overflow anything or lose any precision.

What I would do, then, is first shift the accumulator right by 14
(since this changes the scaling to q17.15), then perform your saturation
logic using 0x00007FFF and 0xFFFF8000 and store the least-significant 16
bits of the accumulator, resulting in q(17-16).15 = q1.15. In order to
round, add 0x00002000 to the accumulator before shifting right by 14.
If you want to protect the round operation from overflow, shift the
accumulator right 1 first, then add 0x00001000, and then shift right by
13 and perform the saturation logic.

> One of my point was also not to loose any dynamic in my output signal
> compares to the dynamic of my input signal but then should i understand it
> is just not possible to save the dynamic range AND to avoid overflow?

That may be true - I'd have to examine your filter coefficients more
closely to say for sure.

But, as a simple thought experiment, consider a very simple 1-tap FIR
filter h[0] = 2.0 (i.e., a gain of two). Then if you want the output
samples to be scaled the same as the input samples, you're going to
necessarily lose dynamic range, since many samples in your input signal
that were originally in range will saturate the "filter" output.

A similar thing goes for your IIR. If, for example, your IIR has a gain
of > 1 in some band, and you input a full-scale sine wave in that band,
then it's going to clip (saturate) at the output, assuming you keep the
scaling of the output samples the same as that of the input samples.
--
%  Randy Yates                  % "I met someone who looks alot like you,
%% Fuquay-Varina, NC            %             she does the things you do,
%%% 919-577-9882                %                     but she is an IBM."
%%%% <yates@ieee.org>           %        'Yours Truly, 2095', *Time*, ELO
http://www.digitalsignallabs.com
```
```>If the accumulator result is q2.30, then to multiply by two, simply
>reinterpret the scaling as q3.29. you don't really do anything
>physically, it's just a reinterpretation. As a simple example, let's
>say the accumulator result was "1.0", or 0x40000000 in q(2,30).
>But 0x4000000 in q(3,29) is "2.0". Voila: we've multiplied by two,
>but we didn't overflow anything or lose any precision.
>
>What I would do, then, is first shift the accumulator right by 14
>(since this changes the scaling to q17.15), then perform your saturation
>logic using 0x00007FFF and 0xFFFF8000 and store the least-significant 16
>bits of the accumulator, resulting in q(17-16).15 = q1.15. In order to
>round, add 0x00002000 to the accumulator before shifting right by 14.
>If you want to protect the round operation from overflow, shift the
>accumulator right 1 first, then add 0x00001000, and then shift right by
>13 and perform the saturation logic.

Let's take an example that really happen in my filter:
my accumulator value (in q2.30 before considering it as q3.29) is
0x21ca49a7: it is not over the max of q2.30, fine.
Now i do what you suggest above add 0x00002000 -> 0x21ca69a7
shift rigth by 14 -> 0x00008729 is over 0x00007fff so my q1.15 output
result is 0x7fff => ok i am saturating

>That may be true - I'd have to examine your filter coefficients more
>closely to say for sure.

the coefficients are b0 = 0.941266298 b1 =
-1.8825326 b2 = 0.941266298 a1 = -1.879089 a2 = 0.885975957

>
>But, as a simple thought experiment, consider a very simple 1-tap FIR
>filter h[0] = 2.0 (i.e., a gain of two). Then if you want the output
>samples to be scaled the same as the input samples, you're going to
>necessarily lose dynamic range, since many samples in your input signal
>that were originally in range will saturate the "filter" output.
>
>A similar thing goes for your IIR. If, for example, your IIR has a gain
>of > 1 in some band, and you input a full-scale sine wave in that band,
>then it's going to clip (saturate) at the output, assuming you keep the
>scaling of the output samples the same as that of the input samples.

I agree with that idea but in my present case, my filter is a high pass
filter and it doesn't seem to me that the gain is ever > 1 (i might be
wrong on this)
that's where my un-understanding lies: how can i saturate if my filter
gain is never >1...
```
```Thanks for your help and good advices Randy.
I think i managed to work out the pb, as you said, my scaling was not
good.
I finally scaled down my inputs by 2, did all the calculation and only in
the end of the process scaled up the output by 2 again and it works: i
don't hear artifacts any more and i kept my input dynamic.
I think the problem was not really coming from having those saturated
samples at the output but rather to put them back in the feedback part of
the filter and do calculation with them.
Anyway , thanks again Randy for the time you took to answer my question
and give precious advice.What is nice with signal processing and Dsp is
that it is a never ending learning topics!
merci beaucoup ;-)

```
```Randy Yates <yates@ieee.org> writes:

> Hi marianne,
> [...]

PS: I checked your original post and found out your accumulator is 64
bits wide. That means that, while the result of multipying two 16-bit
q1.15 numbers is q2.30, the final result after being sign-extended into
the accumulator is q34.30. I was assuming in my earlier post that the
accumulator was 32 bits wide.
--
%  Randy Yates                  % "She's sweet on Wagner-I think she'd die for
Beethoven.
%% Fuquay-Varina, NC            %  She love the way Puccini lays down a tune, and
%%% 919-577-9882                %  Verdi's always creepin' from her room."
%%%% <yates@ieee.org>           % "Rockaria", *A New World Record*,
ELO
http://www.digitalsignallabs.com
```
```Randy Yates <yates@ieee.org> writes:
> [...]
> or q6.11.

Sorry! q6.10.
--
%  Randy Yates                  % "Rollin' and riding and slippin' and
%% Fuquay-Varina, NC            %  sliding, it's magic."
%%% 919-577-9882                %
%%%% <yates@ieee.org>           % 'Living' Thing', *A New World Record*, ELO
http://www.digitalsignallabs.com
```
```Hi marianne,

"marianne" <mary1331@hotmail.com> writes:

>>If the accumulator result is q2.30, then to multiply by two, simply
>>reinterpret the scaling as q3.29. you don't really do anything
>>physically, it's just a reinterpretation. As a simple example, let's
>>say the accumulator result was "1.0", or 0x40000000 in q(2,30).
>>But 0x4000000 in q(3,29) is "2.0". Voila: we've multiplied by
two,
>>but we didn't overflow anything or lose any precision.
>>
>>What I would do, then, is first shift the accumulator right by 14
>>(since this changes the scaling to q17.15), then perform your saturation
>>logic using 0x00007FFF and 0xFFFF8000 and store the least-significant 16
>>bits of the accumulator, resulting in q(17-16).15 = q1.15. In order to
>>round, add 0x00002000 to the accumulator before shifting right by 14.
>>If you want to protect the round operation from overflow, shift the
>>accumulator right 1 first, then add 0x00001000, and then shift right by
>>13 and perform the saturation logic.
>
> Let's take an example that really happen in my filter:
> my accumulator value (in q2.30 before considering it as q3.29) is
> 0x21ca49a7: it is not over the max of q2.30, fine.

Right.

> Now i do what you suggest above add 0x00002000 -> 0x21ca69a7
> shift rigth by 14 -> 0x00008729 is over 0x00007fff so my q1.15 output
> result is 0x7fff => ok i am saturating

That's right. That's because you're trying to stuff a number with 2
integer digits into an output with only 0 integer digits. NB: qa.b
has a-1 integer digits.

See section 4 of my paper near the end where it gives the range of
an A(a,b) scaling. http://digitalsignallabs.com/fp.pdf

I simply showed the proper way to execute the filter given those
scalings. I assumed that you had chosen scalings that satisfy the

>>That may be true - I'd have to examine your filter coefficients more
>>closely to say for sure.
>
> the coefficients are b0 = 0.941266298 b1 = -1.8825326 b2 = 0.941266298
> a1 = -1.879089 a2 = 0.885975957
> [...]
> I agree with that idea but in my present case, my filter is a high pass
> filter and it doesn't seem to me that the gain is ever > 1 (i might be
> wrong on this)
> that's where my un-understanding lies: how can i saturate if my filter
> gain is never >1...

Because, while my example was correct, it wasn't all-inclusive. That is,
you cannot state that "An IIR filter has potential overflow problems if
and only if it has a gain greater than 1 at some frequency." The
situation is more complex: there are other reasons it could overflow and
you've run into one. Essentially, while steady-state sine waves won't
never exceeds 1) as scaled, transients will.

If you want to guarantee that your output will never overflow in this
filter, then you must scale the integer output value a in qa.b as

a = 3 + ceil(log_2(0.951266298 + 1.8825326 + 0.941266298 + 1.879089 +
0.885975957))
= 6,

or q6.11.

The system designer has to determine what is more important: avoiding
overflow at all costs or preserving precision. You can preserve
precision at the cost of potentially saturating a few special case
input signals.
--
%  Randy Yates                  % "Remember the good old 1980's, when
%% Fuquay-Varina, NC            %  things were so uncomplicated?"
%%% 919-577-9882                % 'Ticket To The Moon'
%%%% <yates@ieee.org>           % *Time*, Electric Light Orchestra
http://www.digitalsignallabs.com
```
```>I agree with that idea but in my present case, my filter is a high pass
>filter and it doesn't seem to me that the gain is ever > 1 (i might be
>wrong on this)
>that's where my un-understanding lies: how can i saturate if my filter
>gain is never >1...

You have an IIR which consists of two parts:

* FIR -> foreward path, coefficients (matlab style)
B=[0.941266298,-1.8825326, 0.941266298]
* IIR -> backward path, coefficients A=[1, -1.879089, 0.885975957]

Depending on your implementation you do foreward before feedback path (for
example direct form 1 implementation) (or vice versa).
If you own MATLAB or OCTAVE you get your filter response with:

freqz(B,A) - then you see your filterresponse -- and you are right that it
has no gain, only attenuation

but if you have a closer look to the foreward path you have to type:

freqz(B,1)
then you will see that the foreward path has positive gain

and for the feedback path type
freqz(1,A)
then you will also see positive gain (a lot!) here.

Depending to your implementation (I didn&acute;t read all the posts so I
don&acute;t
know which one you choosed) you will clip the signal in the accumulation
point if you do nothing and you are using that coefficents.

For example you have direct form 1 implementation you have to take care
that the foreward path has no gain. If the feedback path has gain it
should be only there where the foreward path compensates this.

Or you compensate that in the following way: dividing all coefficients by
two and do a shift in left direction (=multiply by two) on the correct
place in your structure (For direct form 1 this is before you split up
y[n] and the delay line for the feedback path - if you have noise problems
then you could do the feedback path in double precision).

if you are not sure what&acute;s direct form.

I hope that helps,

greetings,
Markus

www.two-pi.com

```
```>If the accumulator result is q2.30, then to multiply by two, simply
>reinterpret the scaling as q3.29. you don't really do anything
>physically, it's just a reinterpretation. As a simple example, let's
>say the accumulator result was "1.0", or 0x40000000 in q(2,30).
>But 0x4000000 in q(3,29) is "2.0". Voila: we've multiplied by two,
>but we didn't overflow anything or lose any precision.
>
>What I would do, then, is first shift the accumulator right by 14
>(since this changes the scaling to q17.15), then perform your saturation
>logic using 0x00007FFF and 0xFFFF8000 and store the least-significant 16
>bits of the accumulator, resulting in q(17-16).15 = q1.15. In order to
>round, add 0x00002000 to the accumulator before shifting right by 14.
>If you want to protect the round operation from overflow, shift the
>accumulator right 1 first, then add 0x00001000, and then shift right by
>13 and perform the saturation logic.

Let's take an example that really happen in my filter:
my accumulator value (in q2.30 before considering it as q3.29) is
0x21ca49a7: it is not over the max of q2.30, fine.
Now i do what you suggest above add 0x00002000 -> 0x21ca69a7
shift rigth by 14 -> 0x00008729 is over 0x00007fff so my q1.15 output
result is 0x7fff => ok i am saturating

>That may be true - I'd have to examine your filter coefficients more
>closely to say for sure.

the coefficients are b0 = 0.941266298 b1 =
-1.8825326 b2 = 0.941266298 a1 = -1.879089 a2 = 0.885975957

>
>But, as a simple thought experiment, consider a very simple 1-tap FIR
>filter h[0] = 2.0 (i.e., a gain of two). Then if you want the output
>samples to be scaled the same as the input samples, you're going to
>necessarily lose dynamic range, since many samples in your input signal
>that were originally in range will saturate the "filter" output.
>
>A similar thing goes for your IIR. If, for example, your IIR has a gain
>of > 1 in some band, and you input a full-scale sine wave in that band,
>then it's going to clip (saturate) at the output, assuming you keep the
>scaling of the output samples the same as that of the input samples.

I agree with that idea but in my present case, my filter is a high pass
filter and it doesn't seem to me that the gain is ever > 1 (i might be
wrong on this)
that's where my un-understanding lies: how can i saturate if my filter
gain is never >1...
```
```"marianne" <mary1331@hotmail.com> writes:

You're very welcome, marianne.

>>If your machine doesn't have a saturation mode, then this step is
>>absolutely pointless and potentially harmful. You never want to shift a
>>final result left (which is what multiplying by 2 does) since you risk
>>overflowing and there is no reason to do it.
>
> my machine doesn't have  a saturation mode.
> I was doing this multiplication by two of my accumulator because i read
> the following from rbj on this forum on this issue of having coeffs out of
> the range [-1 1[
>
>
"""""""""""""""""""""""""""""""""""""""""""""""""""""
> y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]
>
>
> you may not scale the a1 or a2 coefficients according to any other
> criteria.
> you *may* scale the numerator coefficients (b0, b1, b2) by a common
> factor
> and, except for quantizing or overflow issues, that will change only the
> level of the output.
>
> now since a1 has a range that exceeds what you can put in a 1.15 number
> format, then what is usually done is that all 5 coefficients are
> represented
> with 1/2 of their correct value and the resulting sum is doubled
> (sometimes
> with an ASL instruction) before it is outputed and saved as a state. that
> is:
>
>
> y[n] = 2*(b0/2*x[n] + b1/2*x[n-1] + b2/2*x[n-2] - a1/2*y[n-1] -
> a2/2*y[n-2])
>
>
> so the coefs you put into the DSP are b0/2, b1/2, b2/2, a1/2, a2/2, and
> the
> result of the sum is 1/2 of what the output is until you multiply that by
> 2
> (usually with an ASL instruction). this is not solely a scaling issue
> because if you don't double the resulting sum, the y[n-1] and y[n-2]
> states
> will not be correct and your filter will not have the shape that you
> designed for.
>""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Aha!

> I do fully agree that multiplicating the accumulation result by 2 will
> cause some overflow but from what RBJ wrote i thought i had to this to
> keep the shape of my filter as i intended to be when calculating the
> coefficient s at the first place.

OK, I see now. Then you do what I mentioned previously: multiply by two
"virtually":

If the accumulator result is q2.30, then to multiply by two, simply
reinterpret the scaling as q3.29. you don't really do anything
physically, it's just a reinterpretation. As a simple example, let's
say the accumulator result was "1.0", or 0x40000000 in q(2,30).
But 0x4000000 in q(3,29) is "2.0". Voila: we've multiplied by two,
but we didn't overflow anything or lose any precision.

What I would do, then, is first shift the accumulator right by 14
(since this changes the scaling to q17.15), then perform your saturation
logic using 0x00007FFF and 0xFFFF8000 and store the least-significant 16
bits of the accumulator, resulting in q(17-16).15 = q1.15. In order to
round, add 0x00002000 to the accumulator before shifting right by 14.
If you want to protect the round operation from overflow, shift the
accumulator right 1 first, then add 0x00001000, and then shift right by
13 and perform the saturation logic.

> One of my point was also not to loose any dynamic in my output signal
> compares to the dynamic of my input signal but then should i understand it
> is just not possible to save the dynamic range AND to avoid overflow?

That may be true - I'd have to examine your filter coefficients more
closely to say for sure.

But, as a simple thought experiment, consider a very simple 1-tap FIR
filter h[0] = 2.0 (i.e., a gain of two). Then if you want the output
samples to be scaled the same as the input samples, you're going to
necessarily lose dynamic range, since many samples in your input signal
that were originally in range will saturate the "filter" output.

A similar thing goes for your IIR. If, for example, your IIR has a gain
of > 1 in some band, and you input a full-scale sine wave in that band,
then it's going to clip (saturate) at the output, assuming you keep the
scaling of the output samples the same as that of the input samples.
--
%  Randy Yates                  % "I met someone who looks alot like you,
%% Fuquay-Varina, NC            %             she does the things you do,
%%% 919-577-9882                %                     but she is an IBM."
%%%% <yates@ieee.org>           %        'Yours Truly, 2095', *Time*, ELO
http://www.digitalsignallabs.com
```
```Thanks for your answer Randy

>If your machine doesn't have a saturation mode, then this step is
>absolutely pointless and potentially harmful. You never want to shift a
>final result left (which is what multiplying by 2 does) since you risk
>overflowing and there is no reason to do it.

my machine doesn't have  a saturation mode.
I was doing this multiplication by two of my accumulator because i read
the following from rbj on this forum on this issue of having coeffs out of
the range [-1 1[

"""""""""""""""""""""""""""""""""""""""""""""""""""""
y[n] = b0*x[n] + b1*x[n-1] + b2*x[n-2] - a1*y[n-1] - a2*y[n-2]

you may not scale the a1 or a2 coefficients according to any other
criteria.
you *may* scale the numerator coefficients (b0, b1, b2) by a common
factor
and, except for quantizing or overflow issues, that will change only the
level of the output.

now since a1 has a range that exceeds what you can put in a 1.15 number
format, then what is usually done is that all 5 coefficients are
represented
with 1/2 of their correct value and the resulting sum is doubled
(sometimes
with an ASL instruction) before it is outputed and saved as a state. that
is:

y[n] = 2*(b0/2*x[n] + b1/2*x[n-1] + b2/2*x[n-2] - a1/2*y[n-1] -
a2/2*y[n-2])

so the coefs you put into the DSP are b0/2, b1/2, b2/2, a1/2, a2/2, and
the
result of the sum is 1/2 of what the output is until you multiply that by
2
(usually with an ASL instruction). this is not solely a scaling issue
because if you don't double the resulting sum, the y[n-1] and y[n-2]
states
will not be correct and your filter will not have the shape that you
designed for.
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

I do fully agree that multiplicating the accumulation result by 2 will
cause some overflow but from what RBJ wrote i thought i had to this to
keep the shape of my filter as i intended to be when calculating the
coefficient s at the first place.

One of my point was also not to loose any dynamic in my output signal
compares to the dynamic of my input signal but then should i understand it
is just not possible to save the dynamic range AND to avoid overflow?

many thanks again for your help

```
```Hi marianne,

"marianne" <mary1331@hotmail.com> writes:
> [...]
> and finally i multiply the accum value by 2 (accum = accum + accum)

If your machine doesn't have a saturation mode, then this step is
absolutely pointless and potentially harmful. You never want to shift a
final result left (which is what multiplying by 2 does) since you risk
overflowing and there is no reason to do it.

Also, depending on the reason for the shift, you either a) change the
scaling of the result, or b) change the gain of the overall filter. You
almost certainly don't want to do either of these at this stage.
Additionally, if you did want to perform b) for some reason, you
can perform the function "virtually" by simply reinterpreting the
scaling without risking an overflow.

> and then compare it to the max values of my q2.30 format ie 0x3fffffff
> and 0xC0000000 to check if i have an overflow or not before rouding
> and scaling down to q1.15: ie if (accum > 0x3fffffff) then i saturate
> my 16bits result to 0x7fff if (accum < 0xC0000000) then i saturate my
> 16bits result to 0x8000 else i round it and shift it by 15 to get back
> to a q1.15

What I would do, instead, is first shift the accumulator right by 15
(since this changes the scaling to q17.15), then perform your saturation
logic using 0x00007FFF and 0xFFFF8000 and store the least-significant 16
bits of the accumulator, resulting in q(17-16).15 = q1.15. In order to
round, add 0x00004000 to the accumulator before shifting right by 15.
If you want to protect the round operation from overflow, shift the
accumulator right 1 first, then add 0x00002000, and then shift right by
14 and perform the saturation logic.

This is assuming your machine doesn't have built-in saturation
abilities. If it did, then you would do things quite a bit differently
in order to make use of them and save some cycles.
--
%  Randy Yates                  % "Midnight, on the water...
%% Fuquay-Varina, NC            %  I saw...  the ocean's daughter."
%%% 919-577-9882                % 'Can't Get It Out Of My Head'
%%%% <yates@ieee.org>           % *El Dorado*, Electric Light Orchestra
http://www.digitalsignallabs.com
```
```Sorry i did many typing errors:

my accumulation equation is
accum = b0\2*sample + b1\2*x[n-1] + b2\2*x[n-2]- a1\2*y[n-1] -
a2\2*y[n-2]

(instead of the +a2\2*y[n-2] i typed before)

and in my particular case coefficients are b0 = 0.941266298 b1 =
-1.8825326 b2 = 0.941266298 a1 = -1.879089 a2 = 0.885975957

another typing error is that i accumulate on a 32 bits register as the
q2.30 format is enough to take the temporary one bit growth that sometimes
happen during accumulation.

and finally i multiply the accum value by 2 (accum = accum + accum) and
then compare it to the max values of my q2.30 format ie 0x3fffffff and
0xC0000000 to check if i have an overflow or not before  rouding and
scaling down to q1.15: ie
if (accum > 0x3fffffff) then i saturate my 16bits result to 0x7fff
if (accum < 0xC0000000) then i saturate my 16bits result to 0x8000
else i round it and shift it by 15 to get back to a q1.15

maybe Vladirmir that's where i got it wrong? for me the accumulation was
still in the q2.30 format.
```