DSPRelated.com
Forums

simple perspective in using floats for recursive moving-average

Started by Randy Yates December 26, 2016
A colleague was explaining that when he tried using floats in a
recursive moving-average filter, there was some residual error which
wasn't present in the standard (non-recursive) moving-average filter.

Would it be correct and reasonable to frame the problem like this:

In the recursive moving-average implementation, the output y[n - 1]
needs to store perfect knowledge of the past N states, so that when you
subtract the input N - 1 samples back from y[n - 1], you perfectly
maintain the sum of inputs x[n - N + 2] to x[n - 1]. However, it does
not. 

In general, every addition/subtraction to y[n - m], m < N, produces a
quantization error; this quantization error is propagated into the new
output y[n].

So it is the fact that the output "state" of the filter (i.e., the
output y[n - 1]) is not perfect, i.e., the state of the filter is
tainted, that creates this residual error.

This also illustrates why a fixed-point implementation (using the
appropriate intermediate accumulator width) of the same filter  
works: the output state is known perfectly.

Correct? Good way to view it? (Trivial???)
-- 
Randy Yates, DSP/Embedded Firmware Developer
Digital Signal Labs
http://www.digitalsignallabs.com
On Sunday, December 25, 2016 at 11:28:00 PM UTC-5, Randy Yates wrote:
> A colleague was explaining that when he tried using floats in a > recursive moving-average filter, there was some residual error which > wasn't present in the standard (non-recursive) moving-average filter. >
let's call this a "moving-sum filter" and separate this from scaling by 1/N on either the input or output.
> Would it be correct and reasonable to frame the problem like this: > > In the recursive moving-average implementation, the output y[n - 1] > needs to store perfect knowledge of the past N states,
yes you do need is perfect knowledge of the N states for either the recursive or non-recursive implementation. what the recursive implementation also needs is *perfect* knowledge of the sum of the last N states. when a sample x[n] is added to the accumulator at the current sample time, one need to assure that exactly the same value will be subtracted N samples later. we can store and retrieve that exact value, but in floating point we are not guaranteed that exactly the value that was added will be what is subtracted because of the different states of the accumulator at the different times. addition and subtraction in floating point normally has rounding error when one quantity is much larger (with different exponent) than the other quantity. but in fixed-point, with an accumulator that is a little wider (by log2(N) bits) on the left so that we can guarantee that the sum of N numbers is exact, then you are absolutely certain that what was added at one time is exactly subtracted at the later time. this accumulator is an integrator with a pole directly on the unit circle at z=1. so it is "critically stable" or "marginally stable". you can have an "almost" rectangular window with a moving-sum or moving-average if you make your integrator pole go slightly inside the unit circle. like 0.999999 . if you do that, the turds left by roundoff error get smaller and smaller. but if the pole is at 1.0, then the turds left by a subtraction being ever-so-slightly different than what was originally added, those turds will never go away. merry Christmas Randy. r b-j
On Sun, 25 Dec 2016 23:27:52 -0500, Randy Yates wrote:

> A colleague was explaining that when he tried using floats in a > recursive moving-average filter, there was some residual error which > wasn't present in the standard (non-recursive) moving-average filter. > > Would it be correct and reasonable to frame the problem like this: > > In the recursive moving-average implementation, the output y[n - 1] > needs to store perfect knowledge of the past N states, so that when you > subtract the input N - 1 samples back from y[n - 1], you perfectly > maintain the sum of inputs x[n - N + 2] to x[n - 1]. However, it does > not. > > In general, every addition/subtraction to y[n - m], m < N, produces a > quantization error; this quantization error is propagated into the new > output y[n]. > > So it is the fact that the output "state" of the filter (i.e., the > output y[n - 1]) is not perfect, i.e., the state of the filter is > tainted, that creates this residual error. > > This also illustrates why a fixed-point implementation (using the > appropriate intermediate accumulator width) of the same filter works: > the output state is known perfectly. > > Correct? Good way to view it? (Trivial???)
Nearly exactly correct -- not _every_ addition/subtraction to the accumulator will cause a rounding error: the significant bits of one or the other addend have to be rounded off. I would expect that if you're using unadulterated ADC inputs and double- precision floating point that you'd have to have a Really Long moving average to have problems. However, you start with data that's not really fixed point, or if you take fixed-point data and multiply it by some floating point value that has non-zero bits all the way down to the least significant bits. (That's not clear. It's true, and it's clear to me, but I can't get the picture in my head into text. Sorry.) You are correct that a (properly done) fixed-point implementation will be bit-exact. I don't think it's trivial, but if you're going to explain it to someone who doesn't know how floats work you should probably show how truncation happens. -- Tim Wescott Control systems, embedded software and circuit design I'm looking for work! See my website if you're interested http://www.wescottdesign.com
robert bristow-johnson <rbj@audioimagination.com> writes:

> On Sunday, December 25, 2016 at 11:28:00 PM UTC-5, Randy Yates wrote: >> A colleague was explaining that when he tried using floats in a >> recursive moving-average filter, there was some residual error which >> wasn't present in the standard (non-recursive) moving-average filter. >> > > let's call this a "moving-sum filter" and separate this from scaling by 1/N on either the input or output. > >> Would it be correct and reasonable to frame the problem like this: >> >> In the recursive moving-average implementation, the output y[n - 1] >> needs to store perfect knowledge of the past N states, > > yes you do need is perfect knowledge of the N states for either the recursive or non-recursive implementation. what the recursive implementation also needs is *perfect* knowledge of the sum of the last N states. > > when a sample x[n] is added to the accumulator at the current sample > time, one need to assure that exactly the same value will be > subtracted N samples later. we can store and retrieve that exact > value, but in floating point we are not guaranteed that exactly the > value that was added will be what is subtracted because of the > different states of the accumulator at the different times. addition > and subtraction in floating point normally has rounding error when one > quantity is much larger (with different exponent) than the other > quantity. > > but in fixed-point, with an accumulator that is a little wider (by > log2(N) bits) on the left so that we can guarantee that the sum of N > numbers is exact, then you are absolutely certain that what was added > at one time is exactly subtracted at the later time. > > this accumulator is an integrator with a pole directly on the unit circle at z=1. so it is "critically stable" or "marginally stable". > > you can have an "almost" rectangular window with a moving-sum or > moving-average if you make your integrator pole go slightly inside the > unit circle. like 0.999999 .
Leaky, leaky...
> if you do that, the turds left by roundoff error get smaller and > smaller. but if the pole is at 1.0, then the turds left by a > subtraction being ever-so-slightly different than what was originally > added, those turds will never go away.
Well that's a bunch of sh...
> merry Christmas Randy.
You too, Robert. -- Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com
Tim Wescott <tim@seemywebsite.com> writes:

> On Sun, 25 Dec 2016 23:27:52 -0500, Randy Yates wrote: > >> A colleague was explaining that when he tried using floats in a >> recursive moving-average filter, there was some residual error which >> wasn't present in the standard (non-recursive) moving-average filter. >> >> Would it be correct and reasonable to frame the problem like this: >> >> In the recursive moving-average implementation, the output y[n - 1] >> needs to store perfect knowledge of the past N states, so that when you >> subtract the input N - 1 samples back from y[n - 1], you perfectly >> maintain the sum of inputs x[n - N + 2] to x[n - 1]. However, it does >> not. >> >> In general, every addition/subtraction to y[n - m], m < N, produces a >> quantization error; this quantization error is propagated into the new >> output y[n]. >> >> So it is the fact that the output "state" of the filter (i.e., the >> output y[n - 1]) is not perfect, i.e., the state of the filter is >> tainted, that creates this residual error. >> >> This also illustrates why a fixed-point implementation (using the >> appropriate intermediate accumulator width) of the same filter works: >> the output state is known perfectly. >> >> Correct? Good way to view it? (Trivial???) > > Nearly exactly correct -- not _every_ addition/subtraction to the > accumulator will cause a rounding error: the significant bits of one or > the other addend have to be rounded off.
Hi Tim, I agree I could have been more precise on that point. But thanks for confirmation of the basic idea.
> I would expect that if you're using unadulterated ADC inputs and double- > precision floating point that you'd have to have a Really Long moving > average to have problems.
I haven't see the code but one thing that they may have been doing is dividing each (12-bit) ADC input by N (N = 100) before summing. So they were adulterated. (Oh my!) Also this is single-precision floating point, not double.
> However, you start with data that's not really fixed point, or if you > take fixed-point data and multiply it by some floating point value > that has non-zero bits all the way down to the least significant bits. > > (That's not clear. It's true, and it's clear to me, but I can't get the > picture in my head into text. Sorry.)
No no, I see what you're saying (I think...).
> You are correct that a (properly done) fixed-point implementation will be > bit-exact.
> I don't think it's trivial, but if you're going to explain it to someone > who doesn't know how floats work you should probably show how truncation > happens.
Yup. And admittedly I don't understand floating-point truncation. -- Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com
On Mon, 26 Dec 2016 17:03:55 -0500, Randy Yates wrote:

> Tim Wescott <tim@seemywebsite.com> writes: > >> On Sun, 25 Dec 2016 23:27:52 -0500, Randy Yates wrote: >> >>> A colleague was explaining that when he tried using floats in a >>> recursive moving-average filter, there was some residual error which >>> wasn't present in the standard (non-recursive) moving-average filter. >>> >>> Would it be correct and reasonable to frame the problem like this: >>> >>> In the recursive moving-average implementation, the output y[n - 1] >>> needs to store perfect knowledge of the past N states, so that when >>> you subtract the input N - 1 samples back from y[n - 1], you perfectly >>> maintain the sum of inputs x[n - N + 2] to x[n - 1]. However, it does >>> not. >>> >>> In general, every addition/subtraction to y[n - m], m < N, produces a >>> quantization error; this quantization error is propagated into the new >>> output y[n]. >>> >>> So it is the fact that the output "state" of the filter (i.e., the >>> output y[n - 1]) is not perfect, i.e., the state of the filter is >>> tainted, that creates this residual error. >>> >>> This also illustrates why a fixed-point implementation (using the >>> appropriate intermediate accumulator width) of the same filter works: >>> the output state is known perfectly. >>> >>> Correct? Good way to view it? (Trivial???) >> >> Nearly exactly correct -- not _every_ addition/subtraction to the >> accumulator will cause a rounding error: the significant bits of one or >> the other addend have to be rounded off. > > Hi Tim, > > I agree I could have been more precise on that point. But thanks for > confirmation of the basic idea. > >> I would expect that if you're using unadulterated ADC inputs and >> double- precision floating point that you'd have to have a Really Long >> moving average to have problems. > > I haven't see the code but one thing that they may have been doing is > dividing each (12-bit) ADC input by N (N = 100) before summing. So they > were adulterated. (Oh my!) > > Also this is single-precision floating point, not double. > >> However, you start with data that's not really fixed point, or if you >> take fixed-point data and multiply it by some floating point value that >> has non-zero bits all the way down to the least significant bits. >> >> (That's not clear. It's true, and it's clear to me, but I can't get >> the picture in my head into text. Sorry.) > > No no, I see what you're saying (I think...). > >> You are correct that a (properly done) fixed-point implementation will >> be bit-exact. > >> I don't think it's trivial, but if you're going to explain it to >> someone who doesn't know how floats work you should probably show how >> truncation happens. > > Yup. And admittedly I don't understand floating-point truncation.
I really wish that I had some semblance of cartoonist chops, or that one of my boys had stuck with it (all of the folks that I know of who are cartoonists were obsessive drawers in their youth -- we don't have that in my family). If I had such at my disposal, there'd be a video about floating point, with a clip explaining truncation featuring a guy with an axe, and a row of ones and zeros. At any rate, single-precision just makes it worse. Division by 100 in binary results in a repeating radix (radix ~ "decimal" here) -- so there's non-zero bits all the way down into the mud, which means that there are opportunities for truncation any time the exponent of the accumulator is different from the exponent of the "adulterated" ADC reading. On the other hand, if they were to do the moving-sum filter without touching the ADC value (or after dividing it by factors of 2^n), there's enough precision in a single-precision floating point to have lossless adds and subtracts. Just talking them into putting the divide by 100 _after_ the moving-sum instead of before would probably fix the problem in this case. If you can't talk them into that, having them do a moving average over 128 samples instead of 100 would probably work, too, unless there's magic to the 100 samples beyond it being a nice round number for people born with ten fingers. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
On Mon, 26 Dec 2016 17:03:55 -0500, Randy Yates wrote:

> Yup. And admittedly I don't understand floating-point truncation.
I just tried for a nice analogy, and failed. But imagine that you want to add 1/3 to 4/3, in a floating point notation that only allows an 8-bit mantissa. 1/3 = 1r0101010x2^-2. 4/3 = 1r0101010x2^0. Note that truncation has already happened, because 1/3 is a repeating "decimal" in binary -- ignore that, and just go with the above values. To add them, you first line them up, then add: r010101010 + 1r0101010 ----------- 1r101010010 This answer is exact in the stored numbers, but it is 10 bits. So, you whack off the two trailing bits: answer = 1r1010100x2^0 This answer is off by (if I'm counting digits correctly) 2^-8. Every time you do an operation, the answer is off. It gets more complicated in the real world, but this is a summation of the effect. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com I'm looking for work -- see my website!
Tim Wescott <seemywebsite@myfooter.really> writes:

> On Mon, 26 Dec 2016 17:03:55 -0500, Randy Yates wrote: > >> Tim Wescott <tim@seemywebsite.com> writes: >> >>> On Sun, 25 Dec 2016 23:27:52 -0500, Randy Yates wrote: >>> >>>> A colleague was explaining that when he tried using floats in a >>>> recursive moving-average filter, there was some residual error which >>>> wasn't present in the standard (non-recursive) moving-average filter. >>>> >>>> Would it be correct and reasonable to frame the problem like this: >>>> >>>> In the recursive moving-average implementation, the output y[n - 1] >>>> needs to store perfect knowledge of the past N states, so that when >>>> you subtract the input N - 1 samples back from y[n - 1], you perfectly >>>> maintain the sum of inputs x[n - N + 2] to x[n - 1]. However, it does >>>> not. >>>> >>>> In general, every addition/subtraction to y[n - m], m < N, produces a >>>> quantization error; this quantization error is propagated into the new >>>> output y[n]. >>>> >>>> So it is the fact that the output "state" of the filter (i.e., the >>>> output y[n - 1]) is not perfect, i.e., the state of the filter is >>>> tainted, that creates this residual error. >>>> >>>> This also illustrates why a fixed-point implementation (using the >>>> appropriate intermediate accumulator width) of the same filter works: >>>> the output state is known perfectly. >>>> >>>> Correct? Good way to view it? (Trivial???) >>> >>> Nearly exactly correct -- not _every_ addition/subtraction to the >>> accumulator will cause a rounding error: the significant bits of one or >>> the other addend have to be rounded off. >> >> Hi Tim, >> >> I agree I could have been more precise on that point. But thanks for >> confirmation of the basic idea. >> >>> I would expect that if you're using unadulterated ADC inputs and >>> double- precision floating point that you'd have to have a Really Long >>> moving average to have problems. >> >> I haven't see the code but one thing that they may have been doing is >> dividing each (12-bit) ADC input by N (N = 100) before summing. So they >> were adulterated. (Oh my!) >> >> Also this is single-precision floating point, not double. >> >>> However, you start with data that's not really fixed point, or if you >>> take fixed-point data and multiply it by some floating point value that >>> has non-zero bits all the way down to the least significant bits. >>> >>> (That's not clear. It's true, and it's clear to me, but I can't get >>> the picture in my head into text. Sorry.) >> >> No no, I see what you're saying (I think...). >> >>> You are correct that a (properly done) fixed-point implementation will >>> be bit-exact. >> >>> I don't think it's trivial, but if you're going to explain it to >>> someone who doesn't know how floats work you should probably show how >>> truncation happens. >> >> Yup. And admittedly I don't understand floating-point truncation. > > I really wish that I had some semblance of cartoonist chops, or that one > of my boys had stuck with it (all of the folks that I know of who are > cartoonists were obsessive drawers in their youth -- we don't have that > in my family). If I had such at my disposal, there'd be a video about > floating point, with a clip explaining truncation featuring a guy with an > axe, and a row of ones and zeros. > > At any rate, single-precision just makes it worse. Division by 100 in > binary results in a repeating radix (radix ~ "decimal" here) -- so > there's non-zero bits all the way down into the mud, which means that > there are opportunities for truncation any time the exponent of the > accumulator is different from the exponent of the "adulterated" ADC > reading.
Aha. That it (possibly) occurs when the exponents of the operands are not the same is key (it seems). From there I guess we're just back to standard textbook truncation at some number of bits? The point of truncation is probably some function of the exponent differences?
> On the other hand, if they were to do the moving-sum filter without > touching the ADC value (or after dividing it by factors of 2^n), there's > enough precision in a single-precision floating point to have lossless > adds and subtracts. Just talking them into putting the divide by 100 > _after_ the moving-sum instead of before would probably fix the problem > in this case. If you can't talk them into that, having them do a moving > average over 128 samples instead of 100 would probably work, too, unless > there's magic to the 100 samples beyond it being a nice round number for > people born with ten fingers.
I too thought that if they would just maintain the pre-division sum this problem would go away. However, and I don't mean to dismay you and Robert's wonderful efforts at bringing light to the issue, they don't really need a fix... The sample rate is only about 3 samples/second, and even on a piddly 20 MHz Coldfire V1, there is still plenty of horsepower available that this doesn't make much of a dent. All this was just to prop up my own understanding of what the heck was going on. Thanks Tim. -- Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com
Randy Yates <yates@digitalsignallabs.com> writes:
> [...] > However, and I don't mean to dismay you and Robert's wonderful efforts > at bringing light to the issue, they don't really need a fix... The > sample rate is only about 3 samples/second, and even on a piddly 20 MHz > Coldfire V1, there is still plenty of horsepower available that this > doesn't make much of a dent.
What I meant here is that they just do the full sum each sample (and thus avoid the truncation/DC build-up error). They don't really care that it's wasteful since the computational load is still very small. -- Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com
Tim Wescott <seemywebsite@myfooter.really> writes:

> On Mon, 26 Dec 2016 17:03:55 -0500, Randy Yates wrote: > >> Yup. And admittedly I don't understand floating-point truncation. > > I just tried for a nice analogy, and failed. But imagine that you want > to add 1/3 to 4/3, in a floating point notation that only allows an 8-bit > mantissa. 1/3 = 1r0101010x2^-2. 4/3 = 1r0101010x2^0. Note that > truncation has already happened, because 1/3 is a repeating "decimal" in > binary -- ignore that, and just go with the above values. To add them, > you first line them up, then add: > > r010101010 > + 1r0101010 > ----------- > 1r101010010 > > This answer is exact in the stored numbers, but it is 10 bits. So, you > whack off the two trailing bits: > > answer = 1r1010100x2^0 > > This answer is off by (if I'm counting digits correctly) 2^-8. Every > time you do an operation, the answer is off. > > It gets more complicated in the real world, but this is a summation of > the effect.
Right. Pretty straightforward. Although I wonder what complicates this? -- Randy Yates, DSP/Embedded Firmware Developer Digital Signal Labs http://www.digitalsignallabs.com