DSPRelated.com
Forums

Some questions about radix-4 FFT algorithm using fixed-point arithmetic

Started by pallav February 7, 2010
Hello all,

I'm very new to signal processing and want to learn to implement some
basic FFT algorithms in hardware. First, I coded up the radix-4 Cooley-
Tukey FFT algorithm in Matlab and it is working fine (compared it
against the Matlab's builtin FFT) when I use floating point number to
represent my signal.

For hardware implementation, the input data is a complex (a + jb). The
real/imaginary are each 8 bits. For fixed point arithmetic, they are
represented as 1 bit for sign, 7 bits for fraction. I'm trying to
implement fixed-point FFT in Matlab but am getting very strange
results. In practice, if the input data is 8 bits each, how many bits
should one use for the internal multiplication/addition? how many bits
for integer part and how many for fraction? Furthermore, how many bits
for the output? It is possible that the output point trasnform could
be greater than [-1,1]. Is there a general rule to determine this?

Here is my Matlab code (in case anyone is curious): http://pastey.net/132596

I'm noticing for fixed-point the transform points getting saturated at
-1,1 but not sure how many bits to use internally and how to divide
them between integer/fraction. Any ideas would be most helpful.

Kind regards.
Sorry - I should mention that my test signal is just cos(x) + jsin(x)
(euler's formula). So the real/imaginary values are in [-1,1]. So I
just pick 64 random degrees (0-360) and generate these values. The
length of the transform is 64 points.

Kind regards.
On Feb 7, 1:17&#4294967295;am, pallav <pallavgu...@gmail.com> wrote:
> Sorry - I should mention that my test signal is just cos(x) + jsin(x) > (euler's formula). So the real/imaginary values are in [-1,1]. So I > just pick 64 random degrees (0-360) and generate these values. The > length of the transform is 64 points. > > Kind regards.
You are going to have to do a lot of reading. There are dozens (at least) papers about fixed and floating point FFT error analysis, and a number of sections in textbooks devoted to the subject. I&#4294967295;d dig out a few references and tell you the titles, but it&#4294967295;s late here, and I don&#4294967295;t feel like spending the next hour or so going through my collection. You could also Google it and get quite a few hits, but you might get just as much from reading a few of the earlier papers and book sections. If I recall, a output of a radix 2 butterfly can achieve a maximum value of 2.1414... , and that occurs for full scale inputs with one of the 45 degree twiddles (+/-.707... types). With radix 4, it&#4294967295;s something else. I don&#4294967295;t know off-hand, and I&#4294967295;m not going to look it up right now. Be careful viewing something in software and then trying to convert it to hardware. You do realize, of course, that MATLAB&#4294967295;s FFTs are seen by the programmer as sequential in, sequential out. Internally, they&#4294967295;re doing bit reversals and a lot else. In hardware, you have to be aware of every detail that&#4294967295;s hidden in the software. And you can do an awful lot in hardware that you won&#4294967295;t see done in software. You&#4294967295;re doing a 64 point radix 4 FFT. Presumably, it&#4294967295;s sequential in, bit reverse out, after which you&#4294967295;d do the bit reversal. You&#4294967295;ll have 3 columns of radix 4 butterflies, so you need inter-stage twiddles between columns 1 and 2, and another between columns 2 and 3. Your outputs will be in radix 4 based order. A transpose will get you sequential results. You might consider a 2 stage radix 8 instead. A radix 8 butterfly only has the +/- .707... twiddles, so there are fewer non-trivial twiddles to deal with. And the +/- .707 twiddles don&#4294967295;t require a full scale multiplier. Depending on your word size, you might be able to do them with as little as one or two full adders. Plus, you have only 2 columns of radix 8 butterflies, with only one group of inter-stage twiddles to deal with. I&#4294967295;d have written a more detailed answer, but it&#4294967295;s late, so I&#4294967295;ll stop here. But be prepared to do a lot of reading, and then a lot more when you start looking into the hundreds of different hardware architectures you might choose to use. Kevin McGee
On Feb 7, 11:28&#4294967295;pm, kevin <kevinjmc...@netscape.net> wrote:
> If I recall, a output of a radix 2 butterfly can achieve a maximum > value of 2.1414... ,
I meant to write: a maximum of 2.414... Dang! Now it's even later than before. Kevin McGee
kevin wrote:
> On Feb 7, 11:28 pm, kevin <kevinjmc...@netscape.net> wrote: >> If I recall, a output of a radix 2 butterfly can achieve a maximum >> value of 2.1414... , > > > I meant to write: a maximum of 2.414... > > Dang! Now it's even later than before. > > Kevin McGee
I say 2.828 Maybe we can average these values :-) Regards, John
On Feb 8, 12:32&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
> On Feb 7, 11:28&#4294967295;pm, kevin <kevinjmc...@netscape.net> wrote: > > > If I recall, a output of a radix 2 butterfly can achieve a maximum > > value of 2.1414... , > > I meant to write: a maximum of 2.414... > > Dang! &#4294967295;Now it's even later than before. > > Kevin McGee
Hi, Thank you for your response. I will search Google / IEEE Transactions and try to find some papers on this fixed-point implementation. Regarding my hardware implementation, for starters, I was just thinking of a fully-parallel architecture where all the data comes in parallel, block FFT is computed, and then, all the outputs are available (also in parallel). Obviously, this will have a huge number of I/Os but its very simple. Alternatively, the data could come in sequentially and I just buffer it. After 64 clock cycles, I start the FFT computation for a number of clock cycles, and then a transpose stage where one point is available every clock cycle (another 64 cycles). There are probably more efficient ways but I'd like to keep it simple initially. Again, I'll search a few papers to search for some simple architectures. Understanding the radix-4 FFT is not a problem as I seem to have a good grasp on that. I will have to search in some books for the maximum value an output can take in a radix-4 FFT as that will determine the number of bits to use for the integer part. Just out of curiosity: I notice that Altera/Xilinx provide FFT IP generators and their interface seems to have one port for data in and one port for data out. Now say there are a large number of points (maybe 1024?). So does the computation start only when all the data inputs have been stored/buffered internally (and thus, the block FFT computation can start)? Thank you again for your time/help. Kind regards.
John Monro wrote:

> I say 2.828 Maybe we can average these values :-)
Yep, 2.8284... is the maximum for a DIF algorithm butterfly, and 2.4142... is the maximum for a DIT butterfly. Late last night, after reading what I posted, I realized that the number in my first post was wrong and quickly did the calculation using a DIT butterfly. I forgot about the DIF case. That&#4294967295;s why I sometimes look this kind of thing up, because I remember a letter to the editor of EDN magazine many years back that mentioned a number of hazards when doing FFT arithmetic. I actually found my copy of it today! (naturally, I found it in my folder for EDN, and not in the one for &#4294967295;FFT error analysis&#4294967295;). &#4294967295;FFT algorithm contains overflow hazard&#4294967295; by Dr. J. W. Locke, plus response by J. Niehaus, EDN magazine, Feb. 23, 1984, pps. 29, 30 and 34. The above contains some further items about maximum bit growth and scaling issues in stages of an FFT. John, it turns out that we were both right! Thanks for pointing out the DIF case. As for pallav, here are a few of the papers about error analysis you might try to obtain: Tran-Thong and Bede Liu, &#4294967295;Fixed-Point Fast Fourier Transform Error Analysis,&#4294967295; IEEE Trans. on Acoustics, Speech and Signal Proc., vol. ASSP-24, no. 6, Dec. 1976, pps. 563-573. E. O. Brigham and L. R. Cecchini, &#4294967295;A Nomogram for Determining FF System Dynamic Range,&#4294967295; IEEE Int&#4294967295;l Conf. on Acoustics, Speech and Signal Proc., pps. 623-627. P. D. Welch, &#4294967295;A Fixed-Point Fast Fourier Transform Error Analysis,&#4294967295; IEEE Trans. on Audio and Electroacoustics, vol. AU-17, no. 2, June 1969, pps. 151-157. And a book: L. R. Rabiner and B. Gold, Theory and Application of Digital Signal Processing, Sect. 10.5, pps. 587-594, Introduction to Quantization Effects in FFT Algorithms, Prentice-Hall, 1975. There are dozens more, but I don&#4294967295;t want to spend too much time typing them in because there are other things to add (and I&#4294967295;m a slow typist). As I mentioned yesterday, the butterflies in a conventional FFT are sequential in/sequential out. So if you&#4294967295;re simulating an FFT butterfly in MATLAB, you have to be aware of the fact that your result has been bit reversed somewhere to get all the outputs in sequence. If you&#4294967295;re trying to do things in hardware, you&#4294967295;re going to have to mimic that process. But most FFT architectures do bit reversal at the front or back end of the algorithm (and some that do it internally because of the way buffers feed adders/subtractors and multipliers). How you implement something in hardware or software can be very different.
> Just out of curiosity: I notice that Altera/Xilinx provide FFT IP > generators and their interface seems to have one port for data in and > one port for data out. Now say there are a large number of points > (maybe 1024?). So does the computation start only when all the data > inputs have been stored/buffered internally (and thus, the block FFT > computation can start)?
FFT architectures are made up of memory and arithmetic units (AU&#4294967295;s) plus control. They can be broadly categorized into: single AU, single column AU&#4294967295;s (does one stage at a time), pipelines (one AU per column) and systolic. The pipelines are particularly good at processing the most data in the least amount of time with a reasonable amount of hardware resources. Basically, they overlap data transfer with calculation. Note that for a sequential-in radix-2 graph, you can start calculating the butterflies of the first column after receiving the first N/2+1 input points. Outputs are fed immediately to a buffer for the second stage. When that buffer gets enough data, the second stage AU starts calculating, etc. I can&#4294967295;t tell you specifically how the Altera/Xilinx implement their FFT cores without looking at the specs. I just scanned Altera&#4294967295;s site, and one of the figures shows the even/odd graph that is typical of a bit reversed in/sequential out algorithm. So they probably load all the data into a buffer first. The bit reversal at the front end can easily be accomplished by proper loading/unloading of the buffer. But I don&#4294967295;t know if the data is processed internally by a single AU or multiple AU&#4294967295;s per column, or pipeline. I&#4294967295;d have to look at it further. The Rabiner and Gold book is a good place to start for pipeline architectures. You might also refer to any one of the hundreds of papers and patents available to learn about the many variations of FFT processors. For example, you might want to read section 2 (description of the prior art) in the PDF file of the following patent: http://www.freepatentsonline.com/4534009.pdf (Hmmm. I think I know that guy). The architecture is the only one I know of that achieves 100 per cent AU efficiency with the minimum amount of memory. And an 8 point version of it would make for a good 8 point butterfly processor. But maybe what you want to do requires something different. You&#4294967295;ll have to explore a lot of the prior art to get up to speed on it (and there&#4294967295;s quite a bit of published material). Kevin McGee