comp.dsp | Some questions about radix-4 FFT algorithm using fixed-point arithmetic

Hello all,

I'm very new to signal processing and want to learn to implement some
basic FFT algorithms in hardware. First, I coded up the radix-4 Cooley-
Tukey FFT algorithm in Matlab and it is working fine (compared it
against the Matlab's builtin FFT) when I use floating point number to
represent my signal.

For hardware implementation, the input data is a complex (a + jb). The
real/imaginary are each 8 bits. For fixed point arithmetic, they are
represented as 1 bit for sign, 7 bits for fraction. I'm trying to
implement fixed-point FFT in Matlab but am getting very strange
results. In practice, if the input data is 8 bits each, how many bits
should one use for the internal multiplication/addition? how many bits
for integer part and how many for fraction? Furthermore, how many bits
for the output? It is possible that the output point trasnform could
be greater than [-1,1]. Is there a general rule to determine this?

Here is my Matlab code (in case anyone is curious): http://pastey.net/132596

I'm noticing for fixed-point the transform points getting saturated at
-1,1 but not sure how many bits to use internally and how to divide
them between integer/fraction. Any ideas would be most helpful.

Kind regards.

Reply by pallav ●February 7, 20102010-02-07

Sorry - I should mention that my test signal is just cos(x) + jsin(x)
(euler's formula). So the real/imaginary values are in [-1,1]. So I
just pick 64 random degrees (0-360) and generate these values. The
length of the transform is 64 points.

Kind regards.

Reply by kevin ●February 8, 20102010-02-08

On Feb 7, 1:17&#4294967295;am, pallav <pallavgu...@gmail.com> wrote:
> Sorry - I should mention that my test signal is just cos(x) + jsin(x)
> (euler's formula). So the real/imaginary values are in [-1,1]. So I
> just pick 64 random degrees (0-360) and generate these values. The
> length of the transform is 64 points.
>
> Kind regards.

You are going to have to do a lot of reading.  There are dozens (at
least) papers about fixed and floating point FFT error analysis, and a
number of sections in textbooks devoted to the subject.  I&#4294967295;d dig out a
few references and tell you the titles, but it&#4294967295;s late here, and I
don&#4294967295;t feel like spending the next hour or so going through my
collection.  You could also Google it and get quite a few hits, but
you might get just as much from reading a few of the earlier papers
and book sections.

If I recall, a output of a radix 2 butterfly can achieve a maximum
value of 2.1414... , and that occurs for full scale inputs with one of
the 45 degree twiddles (+/-.707... types).  With radix 4, it&#4294967295;s
something else.  I don&#4294967295;t know off-hand, and I&#4294967295;m not going to look it
up right now.

Be careful viewing something in software and then trying to convert it
to hardware.  You do realize, of course, that MATLAB&#4294967295;s FFTs are seen
by the programmer as sequential in, sequential out.  Internally,
they&#4294967295;re doing bit reversals and a lot else.  In hardware, you have to
be aware of every detail that&#4294967295;s hidden in the software.  And you can
do an awful lot in hardware that you won&#4294967295;t see done in software.

You&#4294967295;re doing a 64 point radix 4 FFT.  Presumably, it&#4294967295;s sequential in,
bit reverse out, after which you&#4294967295;d do the bit reversal.  You&#4294967295;ll have 3
columns of radix 4 butterflies, so you need inter-stage twiddles
between columns 1 and 2, and another between columns 2 and 3.  Your
outputs will be in radix 4 based order.   A transpose will get you
sequential results.

You might consider a 2 stage radix 8 instead.  A radix 8 butterfly
only has the +/- .707... twiddles, so there are fewer non-trivial
twiddles to deal with.  And the +/- .707 twiddles don&#4294967295;t require a full
scale multiplier.  Depending on your word size, you might be able to
do them with as little as one or two full adders.  Plus, you have only
2 columns of radix 8 butterflies, with only one group of inter-stage
twiddles to deal with.

I&#4294967295;d have written a more detailed answer, but it&#4294967295;s late, so I&#4294967295;ll stop
here.  But be prepared to do a lot of reading, and then a lot more
when you start looking into the hundreds of different hardware
architectures you might choose to use.

Kevin McGee

Reply by kevin ●February 8, 20102010-02-08

On Feb 7, 11:28&#4294967295;pm, kevin <kevinjmc...@netscape.net> wrote:
> If I recall, a output of a radix 2 butterfly can achieve a maximum
> value of 2.1414... ,


I meant to write: a maximum of 2.414...

Dang!  Now it's even later than before.

Kevin McGee

Reply by John Monro ●February 8, 20102010-02-08

kevin wrote:
> On Feb 7, 11:28 pm, kevin <kevinjmc...@netscape.net> wrote:
>> If I recall, a output of a radix 2 butterfly can achieve a maximum
>> value of 2.1414... ,
> 
> 
> I meant to write: a maximum of 2.414...
> 
> Dang!  Now it's even later than before.
> 
> Kevin McGee


I say 2.828   Maybe we can average these values :-)

Regards,
John

Reply by pallav ●February 8, 20102010-02-08

On Feb 8, 12:32&#4294967295;am, kevin <kevinjmc...@netscape.net> wrote:
> On Feb 7, 11:28&#4294967295;pm, kevin <kevinjmc...@netscape.net> wrote:
>
> > If I recall, a output of a radix 2 butterfly can achieve a maximum
> > value of 2.1414... ,
>
> I meant to write: a maximum of 2.414...
>
> Dang! &#4294967295;Now it's even later than before.
>
> Kevin McGee

Hi,

Thank you for your response. I will search Google / IEEE Transactions
and try to find some papers on this fixed-point implementation.
Regarding my hardware implementation, for starters, I was just
thinking of a fully-parallel architecture where all the data comes in
parallel, block FFT is computed, and then, all the outputs are
available (also in parallel). Obviously, this will have a huge number
of I/Os but its very simple. Alternatively, the data could come in
sequentially and I just buffer it. After 64 clock cycles, I start the
FFT computation for a number of clock cycles, and then a transpose
stage where one point is available every clock cycle (another 64
cycles). There are probably more efficient ways but I'd like to keep
it simple initially. Again, I'll search a few papers to search for
some simple architectures.

Understanding the radix-4 FFT is not a problem as I seem to have a
good grasp on that. I will have to search in some books for the
maximum value an output can take in a radix-4 FFT as that will
determine the number of bits to use for the integer part.

Just out of curiosity: I notice that Altera/Xilinx provide FFT IP
generators and their interface seems to have one port for data in and
one port for data out. Now say there are a large number of points
(maybe 1024?). So does the computation start only when all the data
inputs have been stored/buffered internally (and thus, the block FFT
computation can start)?

Thank you again for your time/help.

Kind regards.

Reply by kevin ●February 8, 20102010-02-08

John Monro wrote:

> I say 2.828 Maybe we can average these values :-)

Yep, 2.8284... is the maximum for a DIF algorithm butterfly, and
2.4142... is the maximum for a DIT butterfly.  Late last night, after
reading what I posted, I realized that the number in my first post was
wrong and quickly did the calculation using a DIT butterfly.  I forgot
about the DIF case.  That&#4294967295;s why I sometimes look this kind of thing
up, because I remember a letter to the editor of EDN magazine many
years back that mentioned a number of hazards when doing FFT
arithmetic.  I actually found my copy of it today! (naturally, I found
it in my folder for EDN, and not in the one for &#4294967295;FFT error analysis&#4294967295;).

&#4294967295;FFT algorithm contains overflow hazard&#4294967295; by Dr. J. W. Locke, plus
response by J. Niehaus, EDN magazine, Feb. 23, 1984, pps. 29, 30 and
34.

The above contains some further items about maximum bit growth and
scaling issues in stages of an FFT.

John, it turns out that we were both right!  Thanks for pointing out
the DIF case.

As for pallav, here are a few of the papers about error analysis you
might try to obtain:

Tran-Thong and Bede Liu, &#4294967295;Fixed-Point Fast Fourier Transform Error
Analysis,&#4294967295;  IEEE Trans. on Acoustics, Speech and Signal Proc., vol.
ASSP-24, no. 6, Dec. 1976, pps. 563-573.

E. O. Brigham and L. R. Cecchini, &#4294967295;A Nomogram for Determining FF
System Dynamic Range,&#4294967295; IEEE Int&#4294967295;l Conf. on Acoustics, Speech and
Signal Proc., pps. 623-627.

P. D. Welch, &#4294967295;A Fixed-Point Fast Fourier Transform Error Analysis,&#4294967295;
IEEE Trans. on Audio and Electroacoustics, vol. AU-17, no. 2, June
1969, pps. 151-157.

And a book:

L. R. Rabiner and B. Gold, Theory and Application of Digital Signal
Processing, Sect. 10.5, pps. 587-594, Introduction to Quantization
Effects in FFT Algorithms, Prentice-Hall, 1975.

There are dozens more, but I don&#4294967295;t want to spend too much time typing
them in because there are other things to add (and I&#4294967295;m a slow
typist).  As I mentioned yesterday, the butterflies in a conventional
FFT are sequential in/sequential out.  So if you&#4294967295;re simulating an FFT
butterfly in MATLAB, you have to be aware of the fact that your result
has been bit reversed somewhere to get all the outputs in sequence.
If you&#4294967295;re trying to do things in hardware, you&#4294967295;re going to have to
mimic that process.  But most FFT architectures do bit reversal at the
front or back end of the algorithm (and some that do it internally
because of the way buffers feed adders/subtractors and multipliers).
How you implement something in hardware or software can be very
different.

> Just out of curiosity: I notice that Altera/Xilinx provide FFT IP
> generators and their interface seems to have one port for data in and
> one port for data out. Now say there are a large number of points
> (maybe 1024?). So does the computation start only when all the data
> inputs have been stored/buffered internally (and thus, the block FFT
> computation can start)?

FFT architectures are made up of memory and arithmetic units (AU&#4294967295;s)
plus control.  They can be broadly categorized into: single AU, single
column AU&#4294967295;s (does one stage at a time), pipelines (one AU per column)
and systolic.  The pipelines are particularly good at processing the
most data in the least amount of time with a reasonable amount of
hardware resources.  Basically, they overlap data transfer with
calculation.  Note that for a sequential-in radix-2 graph, you can
start calculating the butterflies of the first column after receiving
the first N/2+1 input points.  Outputs are fed immediately to a buffer
for the second stage. When that buffer gets enough data, the second
stage AU starts calculating, etc.

I can&#4294967295;t tell you specifically how the Altera/Xilinx implement their
FFT cores without looking at the specs. I just scanned Altera&#4294967295;s site,
and one of the figures shows the even/odd graph that is typical of a
bit reversed in/sequential out algorithm.  So they probably load all
the data into a buffer first.  The bit reversal at the front end can
easily be accomplished by proper loading/unloading of the buffer.  But
I don&#4294967295;t know if the data is processed internally by a single AU or
multiple AU&#4294967295;s per column, or pipeline.  I&#4294967295;d have to look at it
further.

The Rabiner and Gold book is a good place to start for pipeline
architectures.  You might also refer to any one of the hundreds of
papers and patents available to learn about the many variations of FFT
processors.  For example, you might want to read section 2
(description of the prior art) in the PDF file of the following
patent:

http://www.freepatentsonline.com/4534009.pdf

(Hmmm. I think I know that guy).  The architecture is the only one I
know of that achieves 100 per cent AU efficiency with the minimum
amount of memory.  And an 8 point version of it would make for a good
8 point butterfly processor.

But maybe what you want to do requires something different.  You&#4294967295;ll
have to explore a lot of the prior art to get up to speed on it (and
there&#4294967295;s quite a bit of published material).

Kevin McGee

Some questions about radix-4 FFT algorithm using fixed-point arithmetic

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group