The topic of "FFT rounding errors" came up on a web forum. The original poster there was working with 18-bit data and was hoping to select between floating point and fixed point DSP chips. My suggestion in that forum was that, with 18-bit real data, an FFT on a 24-bit fixed point DSP would not pose any rounding errors. But the original poster was convinced that floating point would be necessary to avoid rounding errors. Obviously, there's no need to rule out the entire set of fixed-point DSP chips unless there is a specific reason. My impression is that the FFT (and DFT) provide equivalent bit depth precision on input and output. Thus, a 24-bit fixed point FFT should pose no rounding errors on 18-bit data, and would, in fact, have more bits than are significant. Or, to put it another way, if you have 18 bits of precision on input, then you have 18 bits of precision on output, unless you're working on a 16-bit DSP and do not take measures to deal with overflow. I understand that IIR and FIR filters often need a very high bit depth for the filter coefficients, otherwise the actual frequency response differs from the desired frequency response (unless careful compensations are mode). Thus, floating point is often highly beneficial for filters. But I do not have the impression that floating point is necessarily going to show the same benefits for FFT, or any benefit at all, actually. Can anyone comment to confirm my suspicions, or to explain what I am missing? Brian Willoughby Sound Consulting
FFT rounding errors in fixed vs floating point
Started by ●January 30, 2011
Reply by ●January 30, 20112011-01-30
On 01/30/2011 04:02 PM, rsdio wrote:> The topic of "FFT rounding errors" came up on a web forum. The original > poster there was working with 18-bit data and was hoping to select between > floating point and fixed point DSP chips. > > My suggestion in that forum was that, with 18-bit real data, an FFT on a > 24-bit fixed point DSP would not pose any rounding errors. But the > original poster was convinced that floating point would be necessary to > avoid rounding errors. Obviously, there's no need to rule out the entire > set of fixed-point DSP chips unless there is a specific reason. > > My impression is that the FFT (and DFT) provide equivalent bit depth > precision on input and output. Thus, a 24-bit fixed point FFT should pose > no rounding errors on 18-bit data, and would, in fact, have more bits than > are significant. Or, to put it another way, if you have 18 bits of > precision on input, then you have 18 bits of precision on output, unless > you're working on a 16-bit DSP and do not take measures to deal with > overflow. > > I understand that IIR and FIR filters often need a very high bit depth for > the filter coefficients, otherwise the actual frequency response differs > from the desired frequency response (unless careful compensations are > mode). Thus, floating point is often highly beneficial for filters. > > But I do not have the impression that floating point is necessarily going > to show the same benefits for FFT, or any benefit at all, actually. > > Can anyone comment to confirm my suspicions, or to explain what I am > missing?You're making two cognitive errors. Your first cognitive error is that the FFT does not necessarily provide equivalent bit depth precision on input and output. I suppose that on average it does, but not for each specific bin. You understand the demands of a FIR filter on bit depth: the FFT has the same demands. In fact, the FFT is nothing but an efficient algorithm to implement the DFT, and the DFT acts like a bank of N FIR filters, each one N points long, getting run once on data that's N points long. Consider this: assume that you have an ADC that has, by some magic, transition points that are infinitely precise. It still has a finite number of bits, however. This ADC is measuring a signal that is composed of a perfectly constant DC signal, added to noise that is (again magically) band limited in such a manner that all of the noise power appears in the range 0.3 * Fs to 0.4 * Fs when you take the FFT. Now if you take a 65536 point FFT of this ADC output, the precision you would need to capture the DC value of the signal would be 16 bits higher than the width of the ADC. Your second cognitive error (which you may be thinking of in the back of your mind, but maybe not) is that floating point necessarily provides more bit depth than fixed point. IEEE 32-bit floating point only budgets 24 bits to the mantissa, with an additional virtual bit of precision provided by normalization. IEEE 64-bit floating point budgets something like 48 bits to the mantissa (I can't remember the exact number), and so only has one more bit of effective precision after normalization. So _if_ the numbers are scaled correctly, a 24-bit fixed-point filter (I include an FFT here as an example of a 'filter') is going to have almost the same precision as a 32-bit floating point filter, a 32-bit fixed-point filter will have _better_ performance. You could go to 64-bit floating point numbers and get better performance -- but it's a rare DSP chip that doesn't support fast double-width computations, so you could do the same thing with your fixed precision DSP. None of this is intended to minimize the ease of doing DSP in a floating-point environment: it's often way way easier, and the money, power, and system size spent on accommodating a floating point processor is often more than paid for in the ease of engineering the system. But it often isn't, too. -- Tim Wescott Wescott Design Services http://www.wescottdesign.com Do you need to implement control loops in software? "Applied Control Theory for Embedded Systems" was written for you. See details at http://www.wescottdesign.com/actfes/actfes.html
Reply by ●January 30, 20112011-01-30
Tim Wescott <tim@seemywebsite.com> wrote: (snip)> Your second cognitive error (which you may be thinking of in the back of > your mind, but maybe not) is that floating point necessarily provides > more bit depth than fixed point. IEEE 32-bit floating point only > budgets 24 bits to the mantissa, with an additional virtual bit of > precision provided by normalization. IEEE 64-bit floating point budgets > something like 48 bits to the mantissa (I can't remember the exact > number), and so only has one more bit of effective precision after > normalization.IEEE single has a 23 bit significand, plus the 'hidden 1' for 24. IEEE double has 52 bits plus the 'hidden 1', for 53. Is the 'hidden 1' what you are calling 'provided by normalization'? -- glen
Reply by ●January 30, 20112011-01-30
My thought process is it will depend on number of points used for FFT. So best way is to measure the SNR after the FFT is done and see it is able to measure 24 bit input numbers. Else fractional bits needs to be increased. Regards Bharat
Reply by ●January 30, 20112011-01-30
On Jan 30, 7:32�pm, glen herrmannsfeldt <g...@ugcs.caltech.edu> wrote:> Tim Wescott <t...@seemywebsite.com> wrote: > > (snip) > > > Your second cognitive error (which you may be thinking of in the back of > > your mind, but maybe not) is that floating point necessarily provides > > more bit depth than fixed point. �IEEE 32-bit floating point only > > budgets 24 bits to the mantissa, with an additional virtual bit of > > precision provided by normalization. �IEEE 64-bit floating point budgets > > something like 48 bits to the mantissa (I can't remember the exact > > number), and so only has one more bit of effective precision after > > normalization. > > IEEE single has a 23 bit significand, plus the 'hidden 1' for 24. > IEEE double has 52 bits plus the 'hidden 1', for 53. >and remember the sign bit, which, although to the left of the exponent, is something that i would attach to the mantissa. IEEE single can represent an arbitrary 25-bit signed integer exactly. r b-j
Reply by ●January 30, 20112011-01-30
On Jan 30, 9:38�pm, "bharat pathak" <bharat@n_o_s_p_a_m.arithos.com> wrote:> My thought process is it will depend on number of points used for FFT. > So best way is to measure the SNR after the FFT is done and see it is > able to measure 24 bit input numbers. Else fractional bits needs to be > increased.there are closed form expressions for the additive error for both fixed and float. i think they are in Oppenhiem and Schafer, but i don't have my copy handy at the moment r b-j
Reply by ●January 30, 20112011-01-30
On Jan 30, 7:02�pm, "rsdio" <Sound_Consulting@n_o_s_p_a_m.Sounds.wa.com> wrote: ...> > Can anyone comment to confirm my suspicions, or to explain what I am > missing?i think you might be missing what happens to the word length after multiplication with the FFT twiddle factor. also, i think you have a misconception of the relative value of floating-point numbers in filters and the FFT. i would say it is much more common to need floating point in FFTs than it is in IIR or FIR filters. r b-j r b-j
Reply by ●January 31, 20112011-01-31
>i think you might be missing what happens to the word length after >multiplication with the FFT twiddle factor. > >r b-jGood point. The OP's 18-bit input would probably be multiplied by a 24-bit twiddle factor on a fixed-point DSP, and that would produce up to 42 bits (although not necessarily 42 significant bits). Although the typical 24-bit fixed-point DSP might have 48-bit or 56-bit registers, the FFT routine likely truncates those when storing the results. I guess this explains why there is a 32-bit FFT on my 16-bit DSP.
Reply by ●January 31, 20112011-01-31
On Jan 30, 9:40 pm, robert bristow-johnson wrote:>and remember the sign bit, which, although to the left of the >exponent, is something that i would attach to the mantissa. > >IEEE single can represent an arbitrary 25-bit signed integer exactly. > >r b-jThank you for spelling that out. I had 25 in the back of my mind for years, but recently heard many people claiming 24, and I could not remember how I got to 25. It's 23 explicit mantissa bits, 1 hidden, 1 explicit sign bit. Brian
Reply by ●February 2, 20112011-02-02
In article <teWdnSOT5I7yYtjQnZ2dnUVZ_gednZ2d@web-ster.com>, Tim Wescott <tim@seemywebsite.com> wrote:> Your first cognitive error is that the FFT does not necessarily provide > equivalent bit depth precision on input and output. I suppose that on > average it does, but not for each specific bin. You understand the > demands of a FIR filter on bit depth: the FFT has the same demands. In > fact, the FFT is nothing but an efficient algorithm to implement the > DFT, and the DFT acts like a bank of N FIR filters, each one N points > long, getting run once on data that's N points long. > > Consider this: assume that you have an ADC that has, by some magic, > transition points that are infinitely precise. It still has a finite > number of bits, however. This ADC is measuring a signal that is > composed of a perfectly constant DC signal, added to noise that is > (again magically) band limited in such a manner that all of the noise > power appears in the range 0.3 * Fs to 0.4 * Fs when you take the FFT. > > Now if you take a 65536 point FFT of this ADC output, the precision you > would need to capture the DC value of the signal would be 16 bits higher > than the width of the ADC.I almost follow this, but I am not convinced that an N-point FFT is exactly the same as N FIR filters - at least not unless you consider that the equivalent FIR coefficients have only 1 bit of precision. I'm going way out on a limb here, since I do not know the FFT inside and out just yet, but it seems very important that the frequencies are all related by powers of 2. In other words, I don't see why you'd need more than N bits of precision for an N-bit FFT. In other words, my understanding of the problem with FIR coefficients is that finite precision will move your frequencies by quantization, and thus change the shape of your response. As Z�lzer shows in his Digital Audio Signal Processing text, there are discrete positions available for the poles, and you have several options to juggle the sequence of mathematical functions such that you can end up with a different pattern of discrete positions. My inclination is that the discrete positions available in the FFT are precisely the bins that you get - i.e. they're not shifted due to finite precision. However, it does make sense that if you perform an N-bin FFT on an M-bit processor, then it doesn't make sense when N > M because that would seem to shift the frequencies. Someone mentioned the twiddle table in the FFT, but the data in the table is just cos and sin. They're not the same as coefficients, are they? I have the impression that the filter coefficients in the FFT are inherent, not explicit. Then again, I could be totally confusing myself by conflating a description of the DFT by Hal Chamberlin, which describes multiplying the input signal by the center frequency of each bin, which produces the sum and difference frequencies, and summing/integrating the resulting waves over the window to find the DC component, which becomes important when the difference frequency is close to 0. Whether this has any relation to FIR coefficients is beyond me. One thing that does start to make sense in this thread is that multiplication will expand the bit depth, depending upon the precision of the inputs. How that is altered when using FFT implementations that use only add, and not multiply, is also beyond me at the moment. Seems like an addition-only FFT implementation would not expand the bit depth very far beyond the input values - at least not as much as multiplication. Sorry if I'm just confusing the topic with more questions. It would be great if someone could provide references - particularly those that have been hinted at so far. I was happy to add the Z�lzer and Smith (DSP Guide) books to my collection, and I'm sure it wouldn't hurt to add more. Brian Willoughby Sound Consulting






