Reply by robert bristow-johnson September 7, 20152015-09-07
On 9/7/15 4:04 PM, Steve Pope wrote:
> robert bristow-johnson<rbj@audioimagination.com> wrote: > >> On 9/5/15 2:33 PM, Steve Pope wrote: > >>> As I have stated, a given floating point format/spec guarantees that >>> you obtain exact integer arithmetic over a specified range. > >>>> even though the floating-point number coming out of the delay line is >>>> exactly the same as what went in, the number it's getting added to is >>>> not the same. > >>> Sure it is, > >> no, the state of the accumulator will not necessarily be the same when >> x[n] is subtracted as it was when x[n] was added D samples before. > >> but i now agree with you that you can do fixed-point equivalents with >> floats and solve this roundoff turd problem. > > I feel we are going around in circles, much liks a CIC filter. > > Maybe a general, "CIC filters won't work if you make an error in > the arithmetic" would suffice. >
naw, it's not specific enough. the "error in arithmetic" is not quantizing the input to the CIC filter. admittedly. normally we don't specifically introduce another, otherwise unnecessary, quantizer in the signal chain to fix errors. but this time we do. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by Steve Pope September 7, 20152015-09-07
robert bristow-johnson  <rbj@audioimagination.com> wrote:

>On 9/5/15 2:33 PM, Steve Pope wrote:
>> As I have stated, a given floating point format/spec guarantees that >> you obtain exact integer arithmetic over a specified range.
>>> even though the floating-point number coming out of the delay line is >>> exactly the same as what went in, the number it's getting added to is >>> not the same.
>> Sure it is,
>no, the state of the accumulator will not necessarily be the same when >x[n] is subtracted as it was when x[n] was added D samples before.
>but i now agree with you that you can do fixed-point equivalents with >floats and solve this roundoff turd problem.
I feel we are going around in circles, much liks a CIC filter. Maybe a general, "CIC filters won't work if you make an error in the arithmetic" would suffice. Steve
Reply by robert bristow-johnson September 6, 20152015-09-06
On 9/5/15 2:33 PM, Steve Pope wrote:
> robert bristow-johnson<rbj@audioimagination.com> wrote: > >> On 9/4/15 2:18 PM, Steve Pope wrote: > >>> I think your, and Robert's perspective here would apply to >>> situations where one is handed a non-standard floating point >>> processor that is a black box that you don't know what's >>> in it; such as in a poorly-documented embedded processor. >>> But that is a (hopefully rare) special-case, and the >>> behavior of this special case cannot be generalized to >>> floating point in general. > >> no Steve, it's a problem with floating-point CIC in general. glen >> spelled out the problem. > >> you have an accumulator which has a transfer function of > >> H(z) = 1/(1 - z^-1) > >> the pole lies directly on the unit circle. it's what we call >> "critically stable". if the pole was a milli-smidgen inside of the unit >> circle, then floating point would not be a problem. > >> the pole is, of course, canceled by one of the zeros in the comb filter >> which is > >> G(z) = 1 - z^-N > >> where N is the number of samples. > >> anyway, to make this work correctly, you have to guarantee that the >> number that gets added to the accumulator during the present time is >> exactly subtracted from the accumulator N samples later. > > Correct > >> with fixed-point arithmetic you can make that guarantee, but with >> floating point you cannot. > > As I have stated, a given floating point format/spec guarantees that > you obtain exact integer arithmetic over a specified range. > >> even though the floating-point number coming out of the delay line is >> exactly the same as what went in, the number it's getting added to is >> not the same. > > Sure it is,
no, the state of the accumulator will not necessarily be the same when x[n] is subtracted as it was when x[n] was added D samples before. but i now agree with you that you can do fixed-point equivalents with floats and solve this roundoff turd problem.
> if you have not exceeded the range of integers that > is representable by the floating-point format. >
okay, i agree with you on that: if you uniformly quantize the values of the floats to specific integer multiples of the same quantization unit (which should be an integer power of 2 and that exponent may be positive or negative but must be within the floating-point exponent range), and if those uniformly-quantized floats are bounded in magnitude to 2^M times that quantization unit, these floats are equivalent to M+1 bit fixed-point words, but they'll use more bits. then, if the delay in the CIC does not exceed 2^N samples, that is D <= 2^N, you're doing a moving sum of no more than 2^N samples and your sum is no more than 2^(M+N) in magnitude. if you have at least M+N bits in the mantissa, this moving-sum filter will not have quantization errors that get stuck in there like the tar-baby. it's functionally equivalent to rounding the data to fixed-point words (which really are integers) and implementing the CIC with that. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by Steve Pope September 5, 20152015-09-05
glen herrmannsfeldt  <gah@ugcs.caltech.edu> wrote:

[floating point]

>It isn't hard to see what add, subtract, and multiply do. > >For divide, many fixed point algorithms depend on truncation, >and IEEE rounds. (At least it usually does. It might be that >there is an option to change that.)
Yes, something like floor(51.0 / 24.0) is going to be clearly equal to two, but there are perhaps edge cases where it is not so clear. I have not looked into it.
>In the case of rounding, there is the double rounding problem.
>Rounding to N bits, and then rounding the result to M bits >isn't always the same as rounding the original to M bits.
Agree
>In the specific IEEE case, rounding to 53 bits and then 32 can >give a different result from rounding directly to 32.
Well, this should only be true if you've exceeded the integer range of the double before the first rounding. Otherwise, if the result can fit in 32 bits, the following should always work: double x; int n; n = (int) rint(x);
>http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/
>https://en.wikipedia.org/wiki/Rounding#Double_rounding
>https://en.wikipedia.org/wiki/Rounding#Table-maker.27s_dilemma
I'll take a look at these. Thanks. Steve
Reply by Steve Pope September 5, 20152015-09-05
robert bristow-johnson  <rbj@audioimagination.com> wrote:

>On 9/4/15 2:18 PM, Steve Pope wrote:
>> I think your, and Robert's perspective here would apply to >> situations where one is handed a non-standard floating point >> processor that is a black box that you don't know what's >> in it; such as in a poorly-documented embedded processor. >> But that is a (hopefully rare) special-case, and the >> behavior of this special case cannot be generalized to >> floating point in general.
>no Steve, it's a problem with floating-point CIC in general. glen >spelled out the problem.
>you have an accumulator which has a transfer function of
> H(z) = 1/(1 - z^-1)
>the pole lies directly on the unit circle. it's what we call >"critically stable". if the pole was a milli-smidgen inside of the unit >circle, then floating point would not be a problem.
>the pole is, of course, canceled by one of the zeros in the comb filter >which is
> G(z) = 1 - z^-N
>where N is the number of samples.
>anyway, to make this work correctly, you have to guarantee that the >number that gets added to the accumulator during the present time is >exactly subtracted from the accumulator N samples later.
Correct
> with fixed-point arithmetic you can make that guarantee, but with > floating point you cannot.
As I have stated, a given floating point format/spec guarantees that you obtain exact integer arithmetic over a specified range.
>even though the floating-point number coming out of the delay line is >exactly the same as what went in, the number it's getting added to is >not the same.
Sure it is, if you have not exceeded the range of integers that is representable by the floating-point format.
>as glen pointed out, if the exponent of the accumulator >value is different, the truncation of the addition will be different.
Not for a broad class of important cases, including this one. If the difference between the exponents is less than the number of least-significant zero bits of the significands, there is no round-off. Steve
Reply by glen herrmannsfeldt September 5, 20152015-09-05
Steve Pope <spope33@speedymail.org> wrote:
> glen herrmannsfeldt <gah@ugcs.caltech.edu> wrote:
>>OK, if you are doing 16 bit add/subtract using 53 bit significand >>floating point, and arrange appropriately not to need more than >>the 53 bits, subtraction should cancel addition.
> Yes.
> I should own up, however, to the fact that I used doubles for, I dunno, > twenty years or so without trusting them to do integer arithmetic correctly. > It was only after I looked into it (at the urging a colleague) > that I realized it actually works reliably. Prior to that,
(snip) It isn't hard to see what add, subtract, and multiply do. For divide, many fixed point algorithms depend on truncation, and IEEE rounds. (At least it usually does. It might be that there is an option to change that.) In the case of rounding, there is the double rounding problem. Rounding to N bits, and then rounding the result to M bits isn't always the same as rounding the original to M bits. As well as I remember it, it is the same if N is more than twice N. In the specific IEEE case, rounding to 53 bits and then 32 can give a different result from rounding directly to 32. http://www.exploringbinary.com/double-rounding-errors-in-floating-point-conversions/ I am not so sure, but it might be that if you divide 16 bit integers using more than 32 bits rounded, then truncate to 16 it works. (This isn't exactly the double rounding case, which is why I am not so sure.) Also, see: https://en.wikipedia.org/wiki/Rounding#Double_rounding and https://en.wikipedia.org/wiki/Rounding#Table-maker.27s_dilemma -- glen
Reply by Les Cargill September 4, 20152015-09-04
glen herrmannsfeldt wrote:
> Randy Yates <yates@digitalsignallabs.com> wrote: >> spope33@speedymail.org (Steve Pope) writes: >>> robert bristow-johnson <rbj@audioimagination.com> wrote: > >>>> in the CIC moving average (which *does* have a finite impulse response, >>>> but the implementation uses an IIR filter with a pole living directly >>>> on the unit circle), > >>>> when a float s added or subtracted from the accumulator, >>>> there is some kinda roundoff error happening. > > (snip) >>> Sure you do. Floating point operations are completely deterministic, > >> I don't think Robert meant "non-deterministic" but rather >> "non-invertible," in the sense that negative numbers in a group are >> additive inverses and thus are guaranteed to create the identity element >> when added. > > (snip) >>> Matlab stores integer data in doubles, there is no round-off, >>> and CIC filters work correctly and will exactly bit-match a fixed-point >>> implementation. > > OK, if you are doing 16 bit add/subtract using 53 bit significand > floating point, and arrange appropriately not to need more than > the 53 bits, subtraction should cancel addition. > >> "Stores in doubles" and "operates using double arithmetic" are two >> different things. Does Matlab "operate using double arithmetic?" I doubt >> it, if, as you claim, it runs integer CIC filters correctly. > > But on the other hand, using 53 bits (not even counting the > exponent) when you don't need them, is a waste. > >> The reason is that the integer version of a CIC filter depends on the >> property of two's complement integer arithmetic overflow, namely, that >> it "unoverflows" if the appropriate reverse sum(s) is made. Rick Lyons >> pointed this out in this group several years ago. Doubles have no such >> property as far as I know. > > If you can avoid overflow, then you don't need the property. > With a 24 bit significand, there is a good chance of overflowing > the 24 bits. Not so easy with 53 bits. > > But then why not use 32 bit or 64 bit fixed point? >
I suppose the answer is that 64 bit floating point is *generally* a first-class type in the language system. Being able to avail yourself of <math.h> is no small benefit. But then again, so is "long long", so perhaps this argument is spurious.
> -- glen >
-- Les Cargill
Reply by rickman September 4, 20152015-09-04
On 9/4/2015 7:08 PM, robert bristow-johnson wrote:
> On 9/4/15 1:15 PM, rickman wrote: >> On 9/4/2015 1:14 PM, Randy Yates wrote: >>> rickman <gnuarm@gmail.com> writes: >>> > ... >>>> So why can't you understand that the use of floating point vs. integer >>>> in this situation makes no difference? Instead you start talking >>>> about defining the universe and making silly emotional statements >>>> about the rocket blowing up... >>> >>> Rick, >>> >>> Go to a movie. Spend time with your mother. There is more to life than >>> usenet. >> >> Obviously you have lost the thread here. We were talking about the >> causes of the Ariane 5 rocket crash. >> > > so "rocket blowing up" is a silly emotional statement and "rocket crash" > is something else?
The silly emotional part was to leave behind all rational discussion and say the part you snipped. "Come on, Rick. SOMEBODY didn't think about SOMETHING! The damn rocket blew up." So if someone didn't think of something, then floating point must be the problem??? I can't see any rational value to your comment either. Is there something wrong with the term "rocket crash"? What term should I have used to describe a rocket blowing up and crashing? -- Rick
Reply by robert bristow-johnson September 4, 20152015-09-04
On 9/4/15 2:18 PM, Steve Pope wrote:
> > I think your, and Robert's perspective here would apply to > situations where one is handed a non-standard floating point > processor that is a black box that you don't know what's > in it; such as in a poorly-documented embedded processor. > But that is a (hopefully rare) special-case, and the > behavior of this special case cannot be generalized to > floating point in general.
no Steve, it's a problem with floating-point CIC in general. glen spelled out the problem. you have an accumulator which has a transfer function of H(z) = 1/(1 - z^-1) the pole lies directly on the unit circle. it's what we call "critically stable". if the pole was a milli-smidgen inside of the unit circle, then floating point would not be a problem. the pole is, of course, canceled by one of the zeros in the comb filter which is G(z) = 1 - z^-N where N is the number of samples. anyway, to make this work correctly, you have to guarantee that the number that gets added to the accumulator during the present time is exactly subtracted from the accumulator N samples later. with fixed-point arithmetic you can make that guarantee, but with floating point you cannot. if the exact cancellation does not happen, your accumulator is left with a tiny turd that will never be removed. other turds will be added later and you will never remove them. that's the problem. even though the floating-point number coming out of the delay line is exactly the same as what went in, the number it's getting added to is not the same. as glen pointed out, if the exponent of the accumulator value is different, the truncation of the addition will be different. then what truly is subtracted might not be exactly the same as what was added N samples ago. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by Steve Pope September 4, 20152015-09-04
glen herrmannsfeldt  <gah@ugcs.caltech.edu> wrote:

>OK, if you are doing 16 bit add/subtract using 53 bit significand >floating point, and arrange appropriately not to need more than >the 53 bits, subtraction should cancel addition.
Yes. I should own up, however, to the fact that I used doubles for, I dunno, twenty years or so without trusting them to do integer arithmetic correctly. It was only after I looked into it (at the urging a colleague) that I realized it actually works reliably. Prior to that,
>> "Stores in doubles" and "operates using double arithmetic" are two >> different things. Does Matlab "operate using double arithmetic?" I doubt >> it, if, as you claim, it runs integer CIC filters correctly.
>But on the other hand, using 53 bits (not even counting the >exponent) when you don't need them, is a waste.
It's a waste in some circumstances.
>But then why not use 32 bit or 64 bit fixed point?
You'll have to ask the designers of languages like Matlab and Perl and I assume a raft of the newer languages why they don't have native integers types. I think the 90% answer is they are software engineers who are not conditioned to care about efficiency... given all the code bloat, using doubles to store ints is the least of their problems. Steve