comp.dsp | simple perspective in using floats for recursive moving-average

A colleague was explaining that when he tried using floats in a
recursive moving-average filter, there was some residual error which
wasn't present in the standard (non-recursive) moving-average filter.

Would it be correct and reasonable to frame the problem like this:

In the recursive moving-average implementation, the output y[n - 1]
needs to store perfect knowledge of the past N states, so that when you
subtract the input N - 1 samples back from y[n - 1], you perfectly
maintain the sum of inputs x[n - N + 2] to x[n - 1]. However, it does
not. 

In general, every addition/subtraction to y[n - m], m < N, produces a
quantization error; this quantization error is propagated into the new
output y[n].

So it is the fact that the output "state" of the filter (i.e., the
output y[n - 1]) is not perfect, i.e., the state of the filter is
tainted, that creates this residual error.

This also illustrates why a fixed-point implementation (using the
appropriate intermediate accumulator width) of the same filter  
works: the output state is known perfectly.

Correct? Good way to view it? (Trivial???)
-- 
Randy Yates, DSP/Embedded Firmware Developer
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by robert bristow-johnson ●December 26, 20162016-12-26

On Sunday, December 25, 2016 at 11:28:00 PM UTC-5, Randy Yates wrote:
> A colleague was explaining that when he tried using floats in a
> recursive moving-average filter, there was some residual error which
> wasn't present in the standard (non-recursive) moving-average filter.
> 

let's call this a "moving-sum filter" and separate this from scaling by 1/N on either the input or output.

> Would it be correct and reasonable to frame the problem like this:
> 
> In the recursive moving-average implementation, the output y[n - 1]
> needs to store perfect knowledge of the past N states, 

yes you do need is perfect knowledge of the N states for either the recursive or non-recursive implementation.  what the recursive implementation also needs is *perfect* knowledge of the sum of the last N states.

when a sample x[n] is added to the accumulator at the current sample time, one need to assure that exactly the same value will be subtracted N samples later.  we can store and retrieve that exact value, but in floating point we are not guaranteed that exactly the value that was added will be what is subtracted because of the different states of the accumulator at the different times.  addition and subtraction in floating point normally has rounding error when one quantity is much larger (with different exponent) than the other quantity.

but in fixed-point, with an accumulator that is a little wider (by log2(N) bits) on the left so that we can guarantee that the sum of N numbers is exact, then you are absolutely certain that what was added at one time is exactly subtracted at the later time.

this accumulator is an integrator with a pole directly on the unit circle at z=1.  so it is "critically stable" or "marginally stable".

you can have an "almost" rectangular window with a moving-sum or moving-average if you make your integrator pole go slightly inside the unit circle.  like 0.999999 .  if you do that, the turds left by roundoff error get smaller and smaller.  but if the pole is at 1.0, then the turds left by a subtraction being ever-so-slightly different than what was originally added, those turds will never go away.

merry Christmas Randy.

r b-j

Reply by Tim Wescott ●December 26, 20162016-12-26

On Sun, 25 Dec 2016 23:27:52 -0500, Randy Yates wrote:

> A colleague was explaining that when he tried using floats in a
> recursive moving-average filter, there was some residual error which
> wasn't present in the standard (non-recursive) moving-average filter.
> 
> Would it be correct and reasonable to frame the problem like this:
> 
> In the recursive moving-average implementation, the output y[n - 1]
> needs to store perfect knowledge of the past N states, so that when you
> subtract the input N - 1 samples back from y[n - 1], you perfectly
> maintain the sum of inputs x[n - N + 2] to x[n - 1]. However, it does
> not.
> 
> In general, every addition/subtraction to y[n - m], m < N, produces a
> quantization error; this quantization error is propagated into the new
> output y[n].
> 
> So it is the fact that the output "state" of the filter (i.e., the
> output y[n - 1]) is not perfect, i.e., the state of the filter is
> tainted, that creates this residual error.
> 
> This also illustrates why a fixed-point implementation (using the
> appropriate intermediate accumulator width) of the same filter works:
> the output state is known perfectly.
> 
> Correct? Good way to view it? (Trivial???)

Nearly exactly correct -- not _every_ addition/subtraction to the 
accumulator will cause a rounding error: the significant bits of one or 
the other addend have to be rounded off.

I would expect that if you're using unadulterated ADC inputs and double-
precision floating point that you'd have to have a Really Long moving 
average to have problems.  However, you start with data that's not really 
fixed point, or if you take fixed-point data and multiply it by some 
floating point value that has non-zero bits all the way down to the least 
significant bits.

(That's not clear.  It's true, and it's clear to me, but I can't get the 
picture in my head into text.  Sorry.)

You are correct that a (properly done) fixed-point implementation will be 
bit-exact.

I don't think it's trivial, but if you're going to explain it to someone 
who doesn't know how floats work you should probably show how truncation 
happens.

-- 
Tim Wescott
Control systems, embedded software and circuit design
I'm looking for work!  See my website if you're interested
http://www.wescottdesign.com

Reply by Randy Yates ●December 26, 20162016-12-26

robert bristow-johnson <rbj@audioimagination.com> writes:

> On Sunday, December 25, 2016 at 11:28:00 PM UTC-5, Randy Yates wrote:
>> A colleague was explaining that when he tried using floats in a
>> recursive moving-average filter, there was some residual error which
>> wasn't present in the standard (non-recursive) moving-average filter.
>> 
>
> let's call this a "moving-sum filter" and separate this from scaling by 1/N on either the input or output.
>
>> Would it be correct and reasonable to frame the problem like this:
>> 
>> In the recursive moving-average implementation, the output y[n - 1]
>> needs to store perfect knowledge of the past N states, 
>
> yes you do need is perfect knowledge of the N states for either the recursive or non-recursive implementation.  what the recursive implementation also needs is *perfect* knowledge of the sum of the last N states.
>
> when a sample x[n] is added to the accumulator at the current sample
> time, one need to assure that exactly the same value will be
> subtracted N samples later. we can store and retrieve that exact
> value, but in floating point we are not guaranteed that exactly the
> value that was added will be what is subtracted because of the
> different states of the accumulator at the different times. addition
> and subtraction in floating point normally has rounding error when one
> quantity is much larger (with different exponent) than the other
> quantity.
>
> but in fixed-point, with an accumulator that is a little wider (by
> log2(N) bits) on the left so that we can guarantee that the sum of N
> numbers is exact, then you are absolutely certain that what was added
> at one time is exactly subtracted at the later time.
>
> this accumulator is an integrator with a pole directly on the unit circle at z=1.  so it is "critically stable" or "marginally stable".
>
> you can have an "almost" rectangular window with a moving-sum or
> moving-average if you make your integrator pole go slightly inside the
> unit circle. like 0.999999 . 

Leaky, leaky...

> if you do that, the turds left by roundoff error get smaller and
> smaller. but if the pole is at 1.0, then the turds left by a
> subtraction being ever-so-slightly different than what was originally
> added, those turds will never go away.

Well that's a bunch of sh...

> merry Christmas Randy.

You too, Robert. 
-- 
Randy Yates, DSP/Embedded Firmware Developer
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Randy Yates ●December 26, 20162016-12-26

Tim Wescott <tim@seemywebsite.com> writes:

> On Sun, 25 Dec 2016 23:27:52 -0500, Randy Yates wrote:
>
>> A colleague was explaining that when he tried using floats in a
>> recursive moving-average filter, there was some residual error which
>> wasn't present in the standard (non-recursive) moving-average filter.
>> 
>> Would it be correct and reasonable to frame the problem like this:
>> 
>> In the recursive moving-average implementation, the output y[n - 1]
>> needs to store perfect knowledge of the past N states, so that when you
>> subtract the input N - 1 samples back from y[n - 1], you perfectly
>> maintain the sum of inputs x[n - N + 2] to x[n - 1]. However, it does
>> not.
>> 
>> In general, every addition/subtraction to y[n - m], m < N, produces a
>> quantization error; this quantization error is propagated into the new
>> output y[n].
>> 
>> So it is the fact that the output "state" of the filter (i.e., the
>> output y[n - 1]) is not perfect, i.e., the state of the filter is
>> tainted, that creates this residual error.
>> 
>> This also illustrates why a fixed-point implementation (using the
>> appropriate intermediate accumulator width) of the same filter works:
>> the output state is known perfectly.
>> 
>> Correct? Good way to view it? (Trivial???)
>
> Nearly exactly correct -- not _every_ addition/subtraction to the 
> accumulator will cause a rounding error: the significant bits of one or 
> the other addend have to be rounded off.

Hi Tim,

I agree I could have been more precise on that point. But thanks for
confirmation of the basic idea.

> I would expect that if you're using unadulterated ADC inputs and double-
> precision floating point that you'd have to have a Really Long moving 
> average to have problems.  

I haven't see the code but one thing that they may have been doing is
dividing each (12-bit) ADC input by N (N = 100) before summing. So they
were adulterated. (Oh my!)

Also this is single-precision floating point, not double.

> However, you start with data that's not really fixed point, or if you
> take fixed-point data and multiply it by some floating point value
> that has non-zero bits all the way down to the least significant bits.
>
> (That's not clear.  It's true, and it's clear to me, but I can't get the 
> picture in my head into text.  Sorry.)

No no, I see what you're saying (I think...).

> You are correct that a (properly done) fixed-point implementation will be 
> bit-exact.

> I don't think it's trivial, but if you're going to explain it to someone 
> who doesn't know how floats work you should probably show how truncation 
> happens.

Yup. And admittedly I don't understand floating-point truncation.
-- 
Randy Yates, DSP/Embedded Firmware Developer
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Tim Wescott ●December 26, 20162016-12-26

On Mon, 26 Dec 2016 17:03:55 -0500, Randy Yates wrote:

> Tim Wescott <tim@seemywebsite.com> writes:
> 
>> On Sun, 25 Dec 2016 23:27:52 -0500, Randy Yates wrote:
>>
>>> A colleague was explaining that when he tried using floats in a
>>> recursive moving-average filter, there was some residual error which
>>> wasn't present in the standard (non-recursive) moving-average filter.
>>> 
>>> Would it be correct and reasonable to frame the problem like this:
>>> 
>>> In the recursive moving-average implementation, the output y[n - 1]
>>> needs to store perfect knowledge of the past N states, so that when
>>> you subtract the input N - 1 samples back from y[n - 1], you perfectly
>>> maintain the sum of inputs x[n - N + 2] to x[n - 1]. However, it does
>>> not.
>>> 
>>> In general, every addition/subtraction to y[n - m], m < N, produces a
>>> quantization error; this quantization error is propagated into the new
>>> output y[n].
>>> 
>>> So it is the fact that the output "state" of the filter (i.e., the
>>> output y[n - 1]) is not perfect, i.e., the state of the filter is
>>> tainted, that creates this residual error.
>>> 
>>> This also illustrates why a fixed-point implementation (using the
>>> appropriate intermediate accumulator width) of the same filter works:
>>> the output state is known perfectly.
>>> 
>>> Correct? Good way to view it? (Trivial???)
>>
>> Nearly exactly correct -- not _every_ addition/subtraction to the
>> accumulator will cause a rounding error: the significant bits of one or
>> the other addend have to be rounded off.
> 
> Hi Tim,
> 
> I agree I could have been more precise on that point. But thanks for
> confirmation of the basic idea.
> 
>> I would expect that if you're using unadulterated ADC inputs and
>> double- precision floating point that you'd have to have a Really Long
>> moving average to have problems.
> 
> I haven't see the code but one thing that they may have been doing is
> dividing each (12-bit) ADC input by N (N = 100) before summing. So they
> were adulterated. (Oh my!)
> 
> Also this is single-precision floating point, not double.
> 
>> However, you start with data that's not really fixed point, or if you
>> take fixed-point data and multiply it by some floating point value that
>> has non-zero bits all the way down to the least significant bits.
>>
>> (That's not clear.  It's true, and it's clear to me, but I can't get
>> the picture in my head into text.  Sorry.)
> 
> No no, I see what you're saying (I think...).
> 
>> You are correct that a (properly done) fixed-point implementation will
>> be bit-exact.
> 
>> I don't think it's trivial, but if you're going to explain it to
>> someone who doesn't know how floats work you should probably show how
>> truncation happens.
> 
> Yup. And admittedly I don't understand floating-point truncation.

I really wish that I had some semblance of cartoonist chops, or that one 
of my boys had stuck with it (all of the folks that I know of who are 
cartoonists were obsessive drawers in their youth -- we don't have that 
in my family).  If I had such at my disposal, there'd be a video about 
floating point, with a clip explaining truncation featuring a guy with an 
axe, and a row of ones and zeros.

At any rate, single-precision just makes it worse.  Division by 100 in 
binary results in a repeating radix (radix ~ "decimal" here) -- so 
there's non-zero bits all the way down into the mud, which means that 
there are opportunities for truncation any time the exponent of the 
accumulator is different from the exponent of the "adulterated" ADC 
reading.

On the other hand, if they were to do the moving-sum filter without 
touching the ADC value (or after dividing it by factors of 2^n), there's 
enough precision in a single-precision floating point to have lossless 
adds and subtracts.  Just talking them into putting the divide by 100 
_after_ the moving-sum instead of before would probably fix the problem 
in this case.  If you can't talk them into that, having them do a moving 
average over 128 samples instead of 100 would probably work, too, unless 
there's magic to the 100 samples beyond it being a nice round number for 
people born with ten fingers.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

Reply by Tim Wescott ●December 26, 20162016-12-26

On Mon, 26 Dec 2016 17:03:55 -0500, Randy Yates wrote:

> Yup. And admittedly I don't understand floating-point truncation.

I just tried for a nice analogy, and failed.  But imagine that you want 
to add 1/3 to 4/3, in a floating point notation that only allows an 8-bit 
mantissa.  1/3 = 1r0101010x2^-2.  4/3 = 1r0101010x2^0.  Note that 
truncation has already happened, because 1/3 is a repeating "decimal" in 
binary -- ignore that, and just go with the above values.  To add them, 
you first line them up, then add:

    r010101010
 + 1r0101010
  -----------
   1r101010010

This answer is exact in the stored numbers, but it is 10 bits.  So, you 
whack off the two trailing bits:

answer = 1r1010100x2^0

This answer is off by (if I'm counting digits correctly) 2^-8.  Every 
time you do an operation, the answer is off.

It gets more complicated in the real world, but this is a summation of 
the effect.

-- 

Tim Wescott
Wescott Design Services
http://www.wescottdesign.com

I'm looking for work -- see my website!

Reply by Randy Yates ●December 27, 20162016-12-27

Tim Wescott <seemywebsite@myfooter.really> writes:

> On Mon, 26 Dec 2016 17:03:55 -0500, Randy Yates wrote:
>
>> Tim Wescott <tim@seemywebsite.com> writes:
>> 
>>> On Sun, 25 Dec 2016 23:27:52 -0500, Randy Yates wrote:
>>>
>>>> A colleague was explaining that when he tried using floats in a
>>>> recursive moving-average filter, there was some residual error which
>>>> wasn't present in the standard (non-recursive) moving-average filter.
>>>> 
>>>> Would it be correct and reasonable to frame the problem like this:
>>>> 
>>>> In the recursive moving-average implementation, the output y[n - 1]
>>>> needs to store perfect knowledge of the past N states, so that when
>>>> you subtract the input N - 1 samples back from y[n - 1], you perfectly
>>>> maintain the sum of inputs x[n - N + 2] to x[n - 1]. However, it does
>>>> not.
>>>> 
>>>> In general, every addition/subtraction to y[n - m], m < N, produces a
>>>> quantization error; this quantization error is propagated into the new
>>>> output y[n].
>>>> 
>>>> So it is the fact that the output "state" of the filter (i.e., the
>>>> output y[n - 1]) is not perfect, i.e., the state of the filter is
>>>> tainted, that creates this residual error.
>>>> 
>>>> This also illustrates why a fixed-point implementation (using the
>>>> appropriate intermediate accumulator width) of the same filter works:
>>>> the output state is known perfectly.
>>>> 
>>>> Correct? Good way to view it? (Trivial???)
>>>
>>> Nearly exactly correct -- not _every_ addition/subtraction to the
>>> accumulator will cause a rounding error: the significant bits of one or
>>> the other addend have to be rounded off.
>> 
>> Hi Tim,
>> 
>> I agree I could have been more precise on that point. But thanks for
>> confirmation of the basic idea.
>> 
>>> I would expect that if you're using unadulterated ADC inputs and
>>> double- precision floating point that you'd have to have a Really Long
>>> moving average to have problems.
>> 
>> I haven't see the code but one thing that they may have been doing is
>> dividing each (12-bit) ADC input by N (N = 100) before summing. So they
>> were adulterated. (Oh my!)
>> 
>> Also this is single-precision floating point, not double.
>> 
>>> However, you start with data that's not really fixed point, or if you
>>> take fixed-point data and multiply it by some floating point value that
>>> has non-zero bits all the way down to the least significant bits.
>>>
>>> (That's not clear.  It's true, and it's clear to me, but I can't get
>>> the picture in my head into text.  Sorry.)
>> 
>> No no, I see what you're saying (I think...).
>> 
>>> You are correct that a (properly done) fixed-point implementation will
>>> be bit-exact.
>> 
>>> I don't think it's trivial, but if you're going to explain it to
>>> someone who doesn't know how floats work you should probably show how
>>> truncation happens.
>> 
>> Yup. And admittedly I don't understand floating-point truncation.
>
> I really wish that I had some semblance of cartoonist chops, or that one 
> of my boys had stuck with it (all of the folks that I know of who are 
> cartoonists were obsessive drawers in their youth -- we don't have that 
> in my family).  If I had such at my disposal, there'd be a video about 
> floating point, with a clip explaining truncation featuring a guy with an 
> axe, and a row of ones and zeros.
>
> At any rate, single-precision just makes it worse.  Division by 100 in 
> binary results in a repeating radix (radix ~ "decimal" here) -- so 
> there's non-zero bits all the way down into the mud, which means that 
> there are opportunities for truncation any time the exponent of the 
> accumulator is different from the exponent of the "adulterated" ADC 
> reading.

Aha. That it (possibly) occurs when the exponents of the operands are
not the same is key (it seems). From there I guess we're just back to
standard textbook truncation at some number of bits? The point of
truncation is probably some function of the exponent differences?

> On the other hand, if they were to do the moving-sum filter without 
> touching the ADC value (or after dividing it by factors of 2^n), there's 
> enough precision in a single-precision floating point to have lossless 
> adds and subtracts.  Just talking them into putting the divide by 100 
> _after_ the moving-sum instead of before would probably fix the problem 
> in this case.  If you can't talk them into that, having them do a moving 
> average over 128 samples instead of 100 would probably work, too, unless 
> there's magic to the 100 samples beyond it being a nice round number for 
> people born with ten fingers.

I too thought that if they would just maintain the pre-division sum this
problem would go away.

However, and I don't mean to dismay you and Robert's wonderful efforts
at bringing light to the issue, they don't really need a fix... The
sample rate is only about 3 samples/second, and even on a piddly 20 MHz
Coldfire V1, there is still plenty of horsepower available that this
doesn't make much of a dent.

All this was just to prop up my own understanding of what the heck was
going on.

Thanks Tim.
-- 
Randy Yates, DSP/Embedded Firmware Developer
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Randy Yates ●December 27, 20162016-12-27

Randy Yates <yates@digitalsignallabs.com> writes:
> [...]
> However, and I don't mean to dismay you and Robert's wonderful efforts
> at bringing light to the issue, they don't really need a fix... The
> sample rate is only about 3 samples/second, and even on a piddly 20 MHz
> Coldfire V1, there is still plenty of horsepower available that this
> doesn't make much of a dent.

What I meant here is that they just do the full sum each sample (and
thus avoid the truncation/DC build-up error). They don't really care
that it's wasteful since the computational load is still very small.
-- 
Randy Yates, DSP/Embedded Firmware Developer
Digital Signal Labs
http://www.digitalsignallabs.com

Reply by Randy Yates ●December 27, 20162016-12-27

Tim Wescott <seemywebsite@myfooter.really> writes:

> On Mon, 26 Dec 2016 17:03:55 -0500, Randy Yates wrote:
>
>> Yup. And admittedly I don't understand floating-point truncation.
>
> I just tried for a nice analogy, and failed.  But imagine that you want 
> to add 1/3 to 4/3, in a floating point notation that only allows an 8-bit 
> mantissa.  1/3 = 1r0101010x2^-2.  4/3 = 1r0101010x2^0.  Note that 
> truncation has already happened, because 1/3 is a repeating "decimal" in 
> binary -- ignore that, and just go with the above values.  To add them, 
> you first line them up, then add:
>
>     r010101010
>  + 1r0101010
>   -----------
>    1r101010010
>
> This answer is exact in the stored numbers, but it is 10 bits.  So, you 
> whack off the two trailing bits:
>
> answer = 1r1010100x2^0
>
> This answer is off by (if I'm counting digits correctly) 2^-8.  Every 
> time you do an operation, the answer is off.
>
> It gets more complicated in the real world, but this is a summation of 
> the effect.

Right. Pretty straightforward. Although I wonder what complicates this?
-- 
Randy Yates, DSP/Embedded Firmware Developer
Digital Signal Labs
http://www.digitalsignallabs.com

Previous12 3 4 Next

simple perspective in using floats for recursive moving-average

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group