Extended Precision Floating vs Fixed Point - Sharc

Started by Al Clark December 16, 2003
There have been several papers comparing filter topologies and filter 
noise performance. There have also been several papers comparing fixed 
point to floating point processing.

Has anyone examined Extended Floating Point (32 bit mantissa) versus 32 
bit fixed point? Both options are available with a Sharc.

FIR case: 

Fixed point should work very well since you have an 80 bit accumulator 
that is only rounded once after all the MACs. I'm not sure what the 
floating point tradeoff might be since you might have multiplying 
advantages with small coefficients but results are always summed into a 
floating point result which will give effectively reduce the contribution 
of small numbers. 

IIR case:

One acid test is a high Q low frequency bandpass. With a fixed point 
implementation I would use Direct Form I (perhaps with noise shaping).
What are the floating point tradeoffs?


-- 
Al Clark
Danville Signal Processing, Inc.
--------------------------------------------------------------------
Purveyors of Fine DSP Hardware and other Cool Stuff
Available at http://www.danvillesignal.com
Hi, Al.  This is one of my favorite topics, actually!

First of all, keep in mind that the SHARC's extended precision
floating-point actually achieves a 33-bit mantissa because of the "hidden
bit" in the IEEE floating point format.  A small point, but one at least
worth mentioning.

In my application, I ended up using the floating-point mode.  But this was
primarily because the whole system needed to be floating point, not because
the filters are better or worse.  With that said, see specific comments
below.

"Al Clark" <dsp@danvillesignal.com> wrote in message
news:Xns94537DF0991Daclarkdanvillesignal@66.133.130.30...
> There have been several papers comparing filter topologies and filter > noise performance. There have also been several papers comparing fixed > point to floating point processing. > > Has anyone examined Extended Floating Point (32 bit mantissa) versus 32 > bit fixed point? Both options are available with a Sharc. > > FIR case: > > Fixed point should work very well since you have an 80 bit accumulator > that is only rounded once after all the MACs. I'm not sure what the > floating point tradeoff might be since you might have multiplying > advantages with small coefficients but results are always summed into a > floating point result which will give effectively reduce the contribution > of small numbers.
What is the precision of your source data? If your original source data is 24-bits or less, I would think the 40-bit floating point would be plenty adequate for an FIR. Granted, you may doing some rounding after every MAC, but you still have plenty of extra bits (e.g. worst case at least 9 if your source data is 24-bit) so I don't think there would be any significant loss. I don't think it's really necessary to keep _all_ the extra bits, just enough to ensure accuracy of the final rounded result. The 80-bit accumulator with fixed point works excellently as well (I used that in another job). There is no loss of precision in the multiplies and you have plenty of guard bits for overflow. I think both methods would achieve essentially equivalent results, so the decision may hinge more on other factors such as: 1) With floating-point, you don't get a true MAC, just a parallel multiply/add which can be used to make a "pipelined MAC" with some (usually small) overhead. 2) With the accumulator, there are usually some required instructions to set up and get the data out in a usable form. 3) There are only 2 80-bit fixed-point accumulators, but 16 40-bit floating-point "accumulators" (32 if you count the background registers).
> IIR case: > > One acid test is a high Q low frequency bandpass. With a fixed point > implementation I would use Direct Form I (perhaps with noise shaping). > What are the floating point tradeoffs?
In my experience, the filter form is at least as if not more important than the issue of floating-point vs. fixed-point and precision. I've had excellent results in floating-point with the 4-multiply normalized Lattice/Ladder form, though there is obviously some cost in execution and extra coefficient storage. The Direct Form II Transposed is the best of the Direct Forms in my experience. One key with floating-point is to store the delay elements/state variables (the "Z's") as 40-bit data (using 48-bit wide memory). Without this, you are losing much of the benefit of the extended precision--for example in your acid test case, the delay elements contribute strongly to the result. This may cost you in terms of memory usage, execution time, and/or programming hassle factor. In my particular application, it turned out to work better to use the Lattice/Ladder form where I could get away with 32-bit delay element storage rather than deal with 40-bit storage. YMMV. Though I haven't tried 32-bit fixed point IIR's on the SHARC, my hunch is that there would be no advantage over "floating-point done right," e.g. 40-bit math _and_ storage. The 80-bit accumulator provides little or no benefit with an IIR biquad and the floating-point is always going to have more resolution than the fixed (33 vs. 32 with signals close to full scale). IMHO, the extended precision mode on the SHARC is about as ideal of a format to work with as there is in today's DSPs. You have the all the flexibilty and ease of programming of floating-point (no scaling/overflow to deal with!), a very wide mantissa, and plenty of "accumulators". I only wish there was a true floating-point MAC, though I certainly understand that this instruction would probably be the slowest path through the hardware and consequently the limiting factor in clock speed. -Jon
> -- > Al Clark > Danville Signal Processing, Inc. > -------------------------------------------------------------------- > Purveyors of Fine DSP Hardware and other Cool Stuff > Available at http://www.danvillesignal.com
Hi, Al.  This is one of my favorite topics, actually!

First of all, keep in mind that the SHARC's extended precision
floating-point actually achieves a 33-bit mantissa because of the "hidden
bit" in the IEEE floating point format.  A small point, but one at least
worth mentioning.

In my application, I ended up using the floating-point mode.  But this was
primarily because the whole system needed to be floating point, not because
the filters are better or worse.  With that said, see specific comments
below.

"Al Clark" <dsp@danvillesignal.com> wrote in message
news:Xns94537DF0991Daclarkdanvillesignal@66.133.130.30...
> There have been several papers comparing filter topologies and filter > noise performance. There have also been several papers comparing fixed > point to floating point processing. > > Has anyone examined Extended Floating Point (32 bit mantissa) versus 32 > bit fixed point? Both options are available with a Sharc. > > FIR case: > > Fixed point should work very well since you have an 80 bit accumulator > that is only rounded once after all the MACs. I'm not sure what the > floating point tradeoff might be since you might have multiplying > advantages with small coefficients but results are always summed into a > floating point result which will give effectively reduce the contribution > of small numbers.
What is the precision of your source data? If your original source data is 24-bits or less, I would think the 40-bit floating point would be plenty adequate for an FIR. Granted, you may doing some rounding after every MAC, but you still have plenty of extra bits (e.g. worst case at least 9 if your source data is 24-bit) so I don't think there would be any significant loss. I don't think it's really necessary to keep _all_ the extra bits, just enough to ensure accuracy of the final rounded result. The 80-bit accumulator with fixed point works excellently as well (I used that in another job). There is no loss of precision in the multiplies and you have plenty of guard bits for overflow. I think both methods would achieve essentially equivalent results, so the decision may hinge more on other factors such as: 1) With floating-point, you don't get a true MAC, just a parallel multiply/add which can be used to make a "pipelined MAC" with some (usually small) overhead. 2) With the accumulator, there are usually some required instructions to set up and get the data out in a usable form. 3) There are only 2 80-bit fixed-point accumulators, but 16 40-bit floating-point "accumulators" (32 if you count the background registers).
> IIR case: > > One acid test is a high Q low frequency bandpass. With a fixed point > implementation I would use Direct Form I (perhaps with noise shaping). > What are the floating point tradeoffs?
In my experience, the filter form is at least as if not more important than the issue of floating-point vs. fixed-point and precision. I've had excellent results in floating-point with the 4-multiply normalized Lattice/Ladder form, though there is obviously some cost in execution and extra coefficient storage. The Direct Form II Transposed is the best of the Direct Forms in my experience. One key with floating-point is to store the delay elements/state variables (the "Z's") as 40-bit data (using 48-bit wide memory). Without this, you are losing much of the benefit of the extended precision--for example in your acid test case, the delay elements contribute strongly to the result. This may cost you in terms of memory usage, execution time, and/or programming hassle factor. In my particular application, it turned out to work better to use the Lattice/Ladder form where I could get away with 32-bit delay element storage rather than deal with 40-bit storage. YMMV. Though I haven't tried 32-bit fixed point IIR's on the SHARC, my hunch is that there would be no advantage over "floating-point done right," e.g. 40-bit math _and_ storage. The 80-bit accumulator provides little or no benefit with an IIR biquad and the floating-point is always going to have more resolution than the fixed (33 vs. 32 with signals close to full scale). IMHO, the extended precision mode on the SHARC is about as ideal of a format to work with as there is in today's DSPs. You have the all the flexibilty and ease of programming of floating-point (no scaling/overflow to deal with!), a very wide mantissa, and plenty of "accumulators". I only wish there was a true floating-point MAC, though I certainly understand that this instruction would probably be the slowest path through the hardware and consequently the limiting factor in clock speed. -Jon
> -- > Al Clark > Danville Signal Processing, Inc. > -------------------------------------------------------------------- > Purveyors of Fine DSP Hardware and other Cool Stuff > Available at http://www.danvillesignal.com
There have been several papers comparing filter topologies and filter 
noise performance. There have also been several papers comparing fixed 
point to floating point processing.

Has anyone examined Extended Floating Point (32 bit mantissa) versus 32 
bit fixed point? Both options are available with a Sharc.

FIR case: 

Fixed point should work very well since you have an 80 bit accumulator 
that is only rounded once after all the MACs. I'm not sure what the 
floating point tradeoff might be since you might have multiplying 
advantages with small coefficients but results are always summed into a 
floating point result which will give effectively reduce the contribution 
of small numbers. 

IIR case:

One acid test is a high Q low frequency bandpass. With a fixed point 
implementation I would use Direct Form I (perhaps with noise shaping).
What are the floating point tradeoffs?


-- 
Al Clark
Danville Signal Processing, Inc.
--------------------------------------------------------------------
Purveyors of Fine DSP Hardware and other Cool Stuff
Available at http://www.danvillesignal.com