Forums

FFT scaling for hardware implementation

Started by Syswip September 30, 2011
Hi forum,

Do we really need to scale FFT for hardware implementation?

Up to now I found 2 situations where scaling has advantage:

1. It will save the logic because the internal word length remains
unchanged.
2. The twiddle factor multiplication error will be less.

The scaling also can be required if the output word length must be equal to
the input word length. In this case I have a 2 choices:

1. Do not scale (increase word length after each stage) and at the end
shift the result right by the corresponding value.
2. Use scaling.

How to make the choice if I don't care about gate counts?
Does the scaling significantly decrease the FFT error?

Thank you very much,
Tiksan.

On Sep 30, 10:03&#2013266080;am, "Syswip" <syswip@n_o_s_p_a_m.gmail.com> wrote:
> Hi forum, > > Do we really need to scale FFT for hardware implementation? > > Up to now I found 2 situations where scaling has advantage: > > 1. It will save the logic because the internal word length remains > unchanged. > 2. The twiddle factor multiplication error will be less. > > The scaling also can be required if the output word length must be equal to > the input word length. In this case I have a 2 choices: > > 1. Do not scale (increase word length after each stage) and at the end > shift the result right by the corresponding value. > 2. Use scaling. > > How to make the choice if I don't care about gate counts? > Does the scaling significantly decrease the FFT error? > > Thank you very much, > Tiksan.
Well, if you have a nearly fullscale sinusoid centered on one of your bin frequencies, the FFT's output will be (N/2) times your full scale in amplitude where N is the number of samples. Is this a problem to you? More than likely your encountered signals may not be such a pure sinusoid, but if your signal is not flat frequency wise, then you will likely need scaling somewhere in the process so as to not saturate or underflow. You should be able to find many articles on FFT scaling. Clay
Syswip <syswip@n_o_s_p_a_m.gmail.com> wrote:

>Do we really need to scale FFT for hardware implementation?
>[snip]
>1. Do not scale (increase word length after each stage) and at the end >shift the result right by the corresponding value.
>How to make the choice if I don't care about gate counts?
If you don't care about gate counts, or clock speed, then it doesn't matter. However you might think about just how large some of these word widths are before concluding you don't care about gate counts. How many gates is a 384 x 16 bit multiplier? How slow is it? Steve
>Syswip <syswip@n_o_s_p_a_m.gmail.com> wrote: > >>Do we really need to scale FFT for hardware implementation? > >>[snip] > >>1. Do not scale (increase word length after each stage) and at the end >>shift the result right by the corresponding value. > >>How to make the choice if I don't care about gate counts? > >If you don't care about gate counts, or clock speed, then it doesn't >matter. However you might think about just how large some of these >word widths are before concluding you don't care about gate counts. >How many gates is a 384 x 16 bit multiplier? How slow is it? > > >Steve >
Thanks Steve, What about error propagation? Does the scaling decrease FFT rounding error propagation? Bests, Tiksan.
Syswip <syswip@n_o_s_p_a_m.gmail.com> wrote:
>Do we really need to scale FFT for hardware implementation? >[snip]
I'm not sure you mean 'scaling', as opposed to 'rounding or truncate, avoiding overflow, underflow, etc., or something else. Usually, input and output word sizes are the same, but I've seen (and worked on) designs that allow for bit growth to preserve dynamic range and avoid overflow. And the latter is critical. It's also difficult, as noted in a letter to the editor: "FFT algorithm contains overflow hazard," EDN Magazine, Feb. 23, 1984, pps. 29-34. It might be tough to get a copy of it, but it would very quickly dissuade you from thinking that it's a trivial thing to do. You should be aware that in a radix-2 DIT FFT, bit growth per stage can be 2.4142... , and in a DIF stage it can be 2.8284.... So even though the MAGNITUDE of the result can't grow by more than a factor of 2 per stage, the real/imaginary parts most certainly can exceed 2 in a single pass (work through a few examples using +/-.707... twiddles and you'll see it). This kind of bit growth can't occur at every stage, but its potential must be recognized. A simple way to avoid problems is to divide inputs by 2 before doing the multiply/add/subtract operation (DIT), or add/subtract/multiply (DIF). But you lose precision and dynamic range doing that. An alternative (discussed in the above exchange) is to have an overflow anticipation circuit and a divide by 1, 2 or 4 (as noted by the letter writer, a divide by 1 or 2 isn't adequate for the part under question). I think you need to do a great deal of reading before you begin to appreciate all the problems. There's a lot of literature about FFT error analysis for both fixed and floating point (and some for block floating point). I would suggest refereed journals, where papers are judged by experts in the field. The FFTW web site also has some error analysis info, so you might check that out, too. As for some general info about FFT errors (read the literature for all the details): 1) truncation is bad for precision. Rounding to nearest works better. 2) add/subtract errors usually dominate. Numbers are multiplied by twiddle factors, so in a 16 bit x16 bit multiply (32 bit result), errors tend to grow from the extreme right. But when you add/subtract two 16 bit numbers, the error can show up as an overflow out of the leftmost digit, or underflow from the rightmost. Appropriate guard bits and correction circuitry are often used to address these problems. 3) algorithms/architectures can make a big difference. More than 30 years ago, I showed how any radix-2 graph can be modified to have the exact same number of adds/subtracts/multiplies as a higher power-of- two graph. Thus, a 64 point radix-2 pipeline FFT needs only one complex multiplier, plus two special purpose multipliers to handle the +/-.707... twiddles (and multiply by -j is a swap/sign change). A 4096 can be built with only three full multipliers and four special purpose ones, as opposed to the nine full multipliers in a naive design. 4) people routinely do multi-million point FFTs. The error in an FFT grows very slowly. I'll leave it to you to search for the 'big Oh' relationship between error and N. Kevin McGee
>Syswip <syswip@n_o_s_p_a_m.gmail.com> wrote: >>Do we really need to scale FFT for hardware implementation? >>[snip] > >I'm not sure you mean 'scaling', as opposed to 'rounding or truncate, >avoiding overflow, underflow, etc., or something else. > >Usually, input and output word sizes are the same, but I've seen (and >worked on) designs that allow for bit growth to preserve dynamic range >and avoid overflow. And the latter is critical. It's also difficult, >as noted in a letter to the editor: "FFT algorithm contains overflow >hazard," EDN Magazine, Feb. 23, 1984, pps. 29-34. It might be tough >to get a copy of it, but it would very quickly dissuade you from >thinking that it's a trivial thing to do. > >You should be aware that in a radix-2 DIT FFT, bit growth per stage >can be 2.4142... , and in a DIF stage it can be 2.8284.... So even >though the MAGNITUDE of the result can't grow by more than a factor of >2 per stage, the real/imaginary parts most certainly can exceed 2 in a >single pass (work through a few examples using +/-.707... twiddles and >you'll see it). This kind of bit growth can't occur at every stage, >but its potential must be recognized. > >A simple way to avoid problems is to divide inputs by 2 before doing >the multiply/add/subtract operation (DIT), or add/subtract/multiply >(DIF). But you lose precision and dynamic range doing that. An >alternative (discussed in the above exchange) is to have an overflow >anticipation circuit and a divide by 1, 2 or 4 (as noted by the letter >writer, a divide by 1 or 2 isn't adequate for the part under >question). > >I think you need to do a great deal of reading before you begin to >appreciate all the problems. There's a lot of literature about FFT >error analysis for both fixed and floating point (and some for block >floating point). I would suggest refereed journals, where papers are >judged by experts in the field. The FFTW web site also has some error >analysis info, so you might check that out, too. > >As for some general info about FFT errors (read the literature for all >the details): > >1) truncation is bad for precision. Rounding to nearest works better. > >2) add/subtract errors usually dominate. Numbers are multiplied by >twiddle factors, so in a 16 bit x16 bit multiply (32 bit result), >errors tend to grow from the extreme right. But when you add/subtract >two 16 bit numbers, the error can show up as an overflow out of the >leftmost digit, or underflow from the rightmost. Appropriate guard >bits and correction circuitry are often used to address these >problems. > >3) algorithms/architectures can make a big difference. More than 30 >years ago, I showed how any radix-2 graph can be modified to have the >exact same number of adds/subtracts/multiplies as a higher power-of- >two graph. Thus, a 64 point radix-2 pipeline FFT needs only one >complex multiplier, plus two special purpose multipliers to handle the >+/-.707... twiddles (and multiply by -j is a swap/sign change). A >4096 can be built with only three full multipliers and four special >purpose ones, as opposed to the nine full multipliers in a naive >design. > >4) people routinely do multi-million point FFTs. The error in an FFT >grows very slowly. I'll leave it to you to search for the 'big Oh' >relationship between error and N. > > >Kevin McGee >
Hi Kevin McGee, Thank you very much for your explanation. I appreciate your time. A little bit explanation: As you mentioned "add/subtract errors usually dominate". But if I increase the word length after add/subtract operations this error will be equal to 0. My implementation is in hardware and I can do it of course if I don't care about hardware resources. The error occurred from the multiplication by twiddle factor will always exists. As my implementation is fixed point I need to truncate the lower part of the multiplication result. And the only way which will decrease this error is the rounding. Up to now I didn't find any other way. Thank you very much, Tiksan.