comp.dsp | 64 x 64 FFT algorithm for 4096 pt. FFT| page 2

Reply by onkars ●July 1, 20102010-07-01

>On 6/29/2010 6:22 PM, onkars wrote:
>> I am looking for an FFT algorithm that can give me a 4096 pt. FFT using
t=
>wo
>> 64 pt. FFTs. Is it the mixed radix FFT?
>... plus second post of 7:28 PM
>
>4096=3D64x64 is not mixed radix.  It's 2 stages of radix 64 butterflies,
>so it's radix 64.  Mixed radix would be 4096 =3D 128x32, or 32x128, or
>256x16, or 16x256, or ... (it's a long list).
>
>Based on some of your previous posts, I presume that you're looking to
>do this in hardware.  In that case, a 64x64 isn't a bad choice (there
>are better ones), because you can break the 64x64 down into a 4 stage
>pipeline of 8x8x8x8.  Radix 8 is quite efficient in that you only have
>a couple of 'difficult' internal twiddles of the +/-.707.... type, and
>they can be done efficiently with a special purpose multiplier.  So
>you'd have 4 radix 8 butterfly stages (each with a special purpose +/-.
>707... multiplier), plus 3 full scale multipliers for the inter-stage
>twiddles.
>
>There are a lot of additional concepts that could be applied, but I'm
>not really sure exactly what you're trying to do, so I won't get into
>them.
>
>To get an idea of the basic concepts, you may want to look at:
>
>L. R. Rabiner, B. Gold, Theory and Application of Digital Signal
>Processing. Englewood Cliffs, NJ, Prentice-Hall, 1975.
>
>Chapters 6 and 10 in the above deal with FFT and some basic pipeline
>architectures.  Although the book is out of print, you may be able to
>find it at a local university library, or on EBAY.
>
>As for the twiddles between the stages, they're fairly simple to
>figure out.  I could tell you what they are for a 64x64 graph, or an
>8x8x8x8, but then I'd be doing all the work, and you wouldn't really
>learn anything (here's a hint: note the pattern of the input/output
>order and twiddle sequence of each butterfly for the 8x8 graph in Fig.
>10.15 on p. 587 in Rabiner and Gold's book.  Now imagine what the
>graph would look like if you had 2 stages of 64x64 butterflies).
>
>If you get through all of the above and feel a little ambitious, you
>might want to take a look at:
>
>K. J. McGee, =93Comments on =91A 64-Point Fourier Transform Chip for
Video
>Motion Compensation Using Phase Correlation,=94 IEEE J. Solid-State
>Circuits, vol. 33, no. 6, June 1998, pp. 928-932.
>
>In particular, pay attention to Section III in the above:  Equivalence
>of Higher and Lower Radix Graphs.  It has important implications for
>shuffle exchange algorithms and pipeline architectures (and, as noted,
>I originally developed the concept in 1981).
>
>Kevin McGee
>

Hi Kevin,

Thanks for your detailed response.

Actually I am working on building a 4096 pt. FFT processor (8 bit input
with scaling) which is aimed at athroughtput of 20+ GigaSamples/s. 

I am targeting this design for the Virtex 6 FPGA. With some initial
resource estimation and experiments with the Xilinx core FFT, I realized
that the issue with an FPGA was not so much multipliers and memory -- it is
the number of slices that are the bottleneck.

For Virtex 6 I can use the DSP48s for complex multiplication (4 per complex
mult) -- Virtex 6 has 2000 of these. And it also has abundant Block RAMs.
So I have to efficiently use the slices -- saving adders, registers could
help this.

I figured out that I could realistically achieve 350 MHz on the FPGA --
which implies that to achieve 20 Gsamples/s I need approx. 64 samples
output per clock cycle @350 MHz.

So my goals sort of become:
1. Minimize slice usage -- even at the cost of extra memory usage or
slightly more mults (implies reducing adders and registers).
2. Reduce long routes/wires which could bring the max clock freq. below 350
MHz.
3. one thing is for sure -- use MDC structures -- using SDF and SDC
structures saves memory, which is not our goal.

And the options that I consider:
1. Use 64x64 structure -- with reordering memory (RAMs) in between the 2 64
stages. Also, here I intend to use the full-blown parallel-pipelined (i.e.
full row and full column parallelism) implementation of 64 stage
(internally using maybe radix 4 i.e. 2-squared implemenation). The
advantages of this arch is that control will be minimal and simple. Also,
single structure computes the FFT @350 MHz hence the inputs will always
arrive on the same pins -- which feed this FFT. The downside is that the
routes will be long -- and may implact the ability to achieve 350 MHz.

2. Another approach is using X number FFTs (unable to figure out X and fft
length) in parallel (each implemented as radix 4 MDC). And these X parallel
FFTs feeding the 64 inputs of single 64 pt. FFT implemented full blown
(full parallelism) -- which eventually gives 4096 pt. FFT.

3. Use 8x8x8x8 MDC structure FFT. Will need to use 8 of these to get 64
outputs per cycle(with input demux feeding the 8 FFTs). The radix 8 stages
will be implemented using 2^3 (or 2-cube radix). Good point is shorter
routes, but more commutators -- making it complex. Another issue is the
input buffering will have to be managed to achieve 100% utilization. Also,
there will have to be some kind of demux. between the input pins and 8
different FFTs OR use 8 different input pin sets and external demux.

4. similar to (2) but use 4x4x4x4x4x4 MDC. Use 16 of these. Advantage is
even smaller routes. Disadvantages -- more complex commutators and control.
Input pins to 16 FFT routing. 

Do you have any suggestions? Any comments or any other different
architecture or some pointers/references I could use?

Thank you.

Onkar

Reply by kevin ●July 2, 20102010-07-02

On Jul 1, 5:54=A0pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:

(snip)

Some general comments first:

There are a great many pipeline FFT architectures (>100, at least).
You should thoroughly review the patent literature, journals and
conference proceedings.

Not all pipelines are created equal.  Back in the 1980's, I did a
comparative analysis of many architectures.  Some use less memory than
others.  Some allow for much higher input/output data speeds (i.e.:
given a certain technology speed, what's the highest clock rate you
can use?).  In some, the arithmetic units operate only 50% or 25% of
the time.  For one of the architectures, I had to view the original
patent, rather than the published articles, because I determined that
it couldn't possibly work unless you had some extra registers at the
input/output of the arithmetic units (sure enough, they're right there
in the patent, but none of the published articles show them).

You might want to take a look at Fig. 3 in the previously mentioned
1998 journal article of mine.  It shows a 3 stage 64 point 4x4x4.
It's 100% efficient in that all arithmetic units operate with no dead
time.  And one nice thing about it is that the outputs are in line
sequential order (see the figure in the paper for the sequence).

You could either use that as your first 64 point, dump the outputs
into a buffer, and then use the another 4x4x4 for the second 64
point.  Or you could design a 16x16x16.  Your arithmetic unit in that
case will have to handle the +/- .707... twiddles, plus the couple of
others that you get for a 16 point FFT.

As for some of your specific comments:

>> Reduce long routes/wires which could bring the max clock freq. below 350=
 MHz.

A persistent problem.  In the late 1980's, I took a course at LSI
Logic for designing with their gate array.  It was an R+D project, so
we didn't have the money in the budget to pay for a mask charge
(100K), but I was able to use their design tools to simulate a 64
point pipeline that I had designed.  It took a lot of clever ideas to
get it to operate at a 3 nanosecond clock (folding shift registers
back on themselves so that data went in and out of them at spots near
the arithmetic units; fully pipelining both the adders and multipliers
so I could feed them at the maximum clock speed; using the pipelining
delay through the adders/multiplier such that they effectively doubled
as additional storage) .  Although the pipeline could handle it, the I/
O pins would only allow for 5 nanosecond speeds.

You're using an FPGA, so your problems could actually be worse in that
you probably don't have access to the lower level design elements that
I had.

>> Use 64x64 structure -- with reordering memory (RAMs) in between the 2 64
>> stages. Also, here I intend to use the full-blown parallel-pipelined (i.=
e.
>> full row and full column parallelism)

The only time that I would consider using an array architecture like
that is if there were no other way to meet the speed requirement.

>> but more commutators

In custom NMOS and CMOS, they're not too difficult.  Take a look at
the aforementioned 1998 paper and check out reference 4 (US patent
4396994):

http://www.freepatentsonline.com/4396994.html

The commutator operation can be seen and implemented as a barrel
shifter (see Fig. 3 in the above paper).  However, I'm not sure that
you'll have access to the low level design elements that you'd need
when using an FPGA.  You may be forced into using mux or demux.

>> similar to (2) but use 4x4x4x4x4x4 MDC. Use 16 of these. Advantage is
>> even smaller routes. Disadvantages -- more complex commutators and
>> control. Input pins to 16 FFT routing.

Alright.  I'd have to think about it.  I'll look at your post again
tonight.

Kevin McGee

Reply by kevin ●July 3, 20102010-07-03

On Jul 1, 5:54 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:

>I figured out that I could realistically achieve 350 MHz on the FPGA --
>which implies that to achieve 20 Gsamples/s I need approx. 64 samples
>output per clock cycle @350 MHz.

OK.  I get it now why you are probably going to have to use an array
type arrangement to do the 64 point.

But I'm not clear exactly how your data is coming into the VIRTEX
slices - are you using the Ghz serial port?  And how is the input data
partitioned and/or sequenced?  Sequential in, I suppose.

You mentioned the need for 100% utilization for the input buffering.
Quite possibly you may find the following to be helpful.  First,
things like bit reversal or matrix transpose are fairly easy out of a
RAM.  For instance, in several bit reverse in/sequential out
architectures, the data is first bit reversed by reading/writing out
of RAM.  Consider the simple case of an 8 point FFT.  Data points 0 to
7 are loaded into sequential RAM locations, with the addressing
controlled by a counter.  Now the counter has a 2 to 1 mux on every
bit output.  Things are arranged such that the counter output is
either in sequential mode (i.e.: count from 0 to 7), or in bit
reversed mode (count in bit reversed order).  When the second set of 8
data points arrives, you do a read/write of the location.  The data
being read out will be the bit reversed sequence of your first 8
points.  The second set of data will actually be stored in the bit
reversed locations.  When the third set of 8 points arrives, you read/
write again with the counter in sequential mode.  The second set of
data will also come out in bit reversed order.  All you need do is
start out correctly and switch the counter mode every 8 data samples.
It actually works - I've done things that way for a very long time.
Try it using the 8 point example, it's really not at all difficult to
understand.

Here's another thing about the above point:  take a look at the shift
cell / commutator arrangement that you see in a great many pipeline
architectures.  They're all bit reversers or matrix transposers. If
you have a fast enough RAM that can tolerate the input speeds, you
might consider doing things that way.  All you have to do is set up
the counter properly.  Advantage: no more commutators.

And just another thought: presuming your input data is real only, have
you considered using the '2N real FFT with an N complex FFT'
technique.  That would cut your FFT size down to 2048.  There's a
little bit of extra processing involved, but it's roughly equivalent
to adding a column of butterflies at the end.

I really haven't had enough time to look at your post thoroughly.
I'll spend some time over the weekend on it.  I also spent a little
time scanning the VIRTEX specs.

I could suggest other things or different architectures (or make
better comments on your suggestions), but I'd need to have a clearer
idea of just exactly how that input data is coming in.

Kevin McGee

Reply by kevin ●July 3, 20102010-07-03

On Jul 3, 1:38=A0am, kevin <kevinjmc...@netscape.net> wrote:
> On Jul 1, 5:54 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
> wrote:
>
> >I figured out that I could realistically achieve 350 MHz on the FPGA --
> >which implies that to achieve 20 Gsamples/s I need approx. 64 samples
> >output per clock cycle @350 MHz.
>
> OK. =A0I get it now why you are probably going to have to use an array
> type arrangement to do the 64 point.
>
> But I'm not clear exactly how your data is coming into the VIRTEX
> slices - are you using the Ghz serial port? =A0And how is the input data
> partitioned and/or sequenced? =A0Sequential in, I suppose.
>
> You mentioned the need for 100% utilization for the input buffering.
> Quite possibly you may find the following to be helpful. =A0First,
> things like bit reversal or matrix transpose are fairly easy out of a
> RAM. =A0For instance, in several bit reverse in/sequential out
> architectures, the data is first bit reversed by reading/writing out
> of RAM. =A0Consider the simple case of an 8 point FFT. =A0Data points 0 t=
o
> 7 are loaded into sequential RAM locations, with the addressing
> controlled by a counter. =A0Now the counter has a 2 to 1 mux on every
> bit output. =A0Things are arranged such that the counter output is
> either in sequential mode (i.e.: count from 0 to 7), or in bit
> reversed mode (count in bit reversed order). =A0When the second set of 8
> data points arrives, you do a read/write of the location. =A0The data
> being read out will be the bit reversed sequence of your first 8
> points. =A0The second set of data will actually be stored in the bit
> reversed locations. =A0When the third set of 8 points arrives, you read/
> write again with the counter in sequential mode. =A0The second set of
> data will also come out in bit reversed order. =A0All you need do is
> start out correctly and switch the counter mode every 8 data samples.
> It actually works - I've done things that way for a very long time.
> Try it using the 8 point example, it's really not at all difficult to
> understand.
>
> Here's another thing about the above point: =A0take a look at the shift
> cell / commutator arrangement that you see in a great many pipeline
> architectures. =A0They're all bit reversers or matrix transposers. If
> you have a fast enough RAM that can tolerate the input speeds, you
> might consider doing things that way. =A0All you have to do is set up
> the counter properly. =A0Advantage: no more commutators.
>
> And just another thought: presuming your input data is real only, have
> you considered using the '2N real FFT with an N complex FFT'
> technique. =A0That would cut your FFT size down to 2048. =A0There's a
> little bit of extra processing involved, but it's roughly equivalent
> to adding a column of butterflies at the end.
>
> I really haven't had enough time to look at your post thoroughly.
> I'll spend some time over the weekend on it. =A0I also spent a little
> time scanning the VIRTEX specs.
>
> I could suggest other things or different architectures (or make
> better comments on your suggestions), but I'd need to have a clearer
> idea of just exactly how that input data is coming in.
>
> Kevin McGee

Meant to say: the counter outputs go to a 2 to 1 mux such that you can
select either sequential output or bit reversed output.

Reply by onkars ●July 4, 20102010-07-04

>On Jul 3, 1:38=A0am, kevin <kevinjmc...@netscape.net> wrote:
>> On Jul 1, 5:54 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
>> wrote:
>>
>> >I figured out that I could realistically achieve 350 MHz on the FPGA
--
>> >which implies that to achieve 20 Gsamples/s I need approx. 64 samples
>> >output per clock cycle @350 MHz.
>>
>> OK. =A0I get it now why you are probably going to have to use an array
>> type arrangement to do the 64 point.
>>
>> But I'm not clear exactly how your data is coming into the VIRTEX
>> slices - are you using the Ghz serial port? =A0And how is the input
data
>> partitioned and/or sequenced? =A0Sequential in, I suppose.
>>
>> You mentioned the need for 100% utilization for the input buffering.
>> Quite possibly you may find the following to be helpful. =A0First,
>> things like bit reversal or matrix transpose are fairly easy out of a
>> RAM. =A0For instance, in several bit reverse in/sequential out
>> architectures, the data is first bit reversed by reading/writing out
>> of RAM. =A0Consider the simple case of an 8 point FFT. =A0Data points 0
t=
>o
>> 7 are loaded into sequential RAM locations, with the addressing
>> controlled by a counter. =A0Now the counter has a 2 to 1 mux on every
>> bit output. =A0Things are arranged such that the counter output is
>> either in sequential mode (i.e.: count from 0 to 7), or in bit
>> reversed mode (count in bit reversed order). =A0When the second set of
8
>> data points arrives, you do a read/write of the location. =A0The data
>> being read out will be the bit reversed sequence of your first 8
>> points. =A0The second set of data will actually be stored in the bit
>> reversed locations. =A0When the third set of 8 points arrives, you
read/
>> write again with the counter in sequential mode. =A0The second set of
>> data will also come out in bit reversed order. =A0All you need do is
>> start out correctly and switch the counter mode every 8 data samples.
>> It actually works - I've done things that way for a very long time.
>> Try it using the 8 point example, it's really not at all difficult to
>> understand.
>>
>> Here's another thing about the above point: =A0take a look at the shift
>> cell / commutator arrangement that you see in a great many pipeline
>> architectures. =A0They're all bit reversers or matrix transposers. If
>> you have a fast enough RAM that can tolerate the input speeds, you
>> might consider doing things that way. =A0All you have to do is set up
>> the counter properly. =A0Advantage: no more commutators.
>>
>> And just another thought: presuming your input data is real only, have
>> you considered using the '2N real FFT with an N complex FFT'
>> technique. =A0That would cut your FFT size down to 2048. =A0There's a
>> little bit of extra processing involved, but it's roughly equivalent
>> to adding a column of butterflies at the end.
>>
>> I really haven't had enough time to look at your post thoroughly.
>> I'll spend some time over the weekend on it. =A0I also spent a little
>> time scanning the VIRTEX specs.
>>
>> I could suggest other things or different architectures (or make
>> better comments on your suggestions), but I'd need to have a clearer
>> idea of just exactly how that input data is coming in.
>>
>> Kevin McGee
>
>Meant to say: the counter outputs go to a 2 to 1 mux such that you can
>select either sequential output or bit reversed output.
>


Thank you. I really appreciate your time.

As for the inputs to the FPGA -- thats an open aspect too. It can be
designed to best suite the FFT implementation. But we will be using the
HighSpeed serial IOs. So it pretty much depends on the how the FFT is
implemented -- we could even use external fast demuxes if necessary. 

Also, the input is complex.

As mentioned earlier I should have 64 samples output per clock cycle. As a
reminder I am doing 4096 pt. fft.

So this can be done using various options e.g. 

one FFT made of 64-path radix-64 using 2 stages (full row and column
parallel implementation of radix-64)  

OR 

4 FFTs made of 16-path radix-16 using 3 stages (full row and column
parallel implementation of radix-16) 

OR

8 FFTs made of 8-path radix-8 (full row and column parallel implementation
of radix-8) using 4 stages


Using smaller radices has following disadvantages ---
The number of stages and required number of FFTs (to achieve 64
samples/clock) increases. Reasonably assuming that the number arithmetic
units for any approach is same --- this means that the control overhead
(scheduling logic + commutators using logic not mem. etc.) is increased. I
am not sure how much percentage this is.

Advantage of smaller radices is that the routes will be shorter (???) and
higher frequencies can be achievable. Will the routes when using a single
64-path radix-64 solution be longer than when using 4 16-path radix-16 ffts
??

Somehow I am inclined towards going for the 16 path radix 16 solution, and
aim to achieve 250 to 300 MHz. I have seen a paper (credible) claim they
achieved 40 MHz for a fully parallel (row and column) 256 pt. fft using
1024 radix 2 butterflies on Virtex 5.

Reply by kevin ●July 4, 20102010-07-04

On Jul 3, 11:09 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:

>As for the inputs to the FPGA -- thats an open aspect too. It can be
>designed to best suite the FFT implementation. But we will be using the
>HighSpeed serial IOs. So it pretty much depends on the how the FFT is
>implemented -- we could even use external fast demuxes if necessary.

>Also, the input is complex.

>As mentioned earlier I should have 64 samples output per clock cycle. As a
>reminder I am doing 4096 pt. fft.

Very useful information.

I'm a little concerned about the rest of your post because I'm
presuming that your input data is sequential.  That is, 8 bits each of
points 0, 1, 2, 3 ... 63, followed by points 64, 65, ... 127, then
128 ... 191, then 192 ... 255, all the way up to 4095.  So you have 64
eight bit sequential inputs per clock tick at 375 Mhz (512 bits
total).

Here's the problem.  With a 4096 point FFT, you've got 2 stages of 64
point butterflies with 1 column of inter-stage twiddles between them.
No matter what the input/output order, geometry type or decimation,
the top most butterfly in the first stage uses the following inputs:
0, 64, 128, 192 ... 4032 (every 64th input - the last one listed is
4096-64 = 4032).  The next butterfly down from the top in the first
stage uses the same sequence above PLUS 1 (1, 65, 129, 193 ... 4033).
The next butterfly down uses  a PLUS 2 (2, 66, 130, 194, ... 4034).
Once again, samples separated by 64.

If you need any clarification, I can send a 64 point 8x8 bitmap
(or .pdf or whatever) via 'reply to author'.  It's not at all
difficult to explain what a 64x64 graph looks like after viewing the
8x8 (and I can send the explanation, too).

Now you can break down the a 4096 many different ways, but you'll
still have the same problem of how to feed the first stage of the
pipeline with the appropriate grouping of inputs.  I was concerned
because you post kind of implied that you've already got your 4096
data coming in as if it were in bit-reversed order.  Bit-reversing
your 4096 inputs when they're coming in 64 at a time is not trivial
and requires some careful thought.  A double buffer would work, but
it's extra hardware.

Not to worry.  I'm sure it can be done, but the devil's in the
details.

I've also got a couple of other suggestions, but I'll post them later,
and think about your buffering problem in the meantime.

You've got an interesting problem.

Kevin McGee

Reply by onkars ●July 5, 20102010-07-05

>
>On Jul 3, 11:09 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
>wrote:
>
>>As for the inputs to the FPGA -- thats an open aspect too. It can be
>>designed to best suite the FFT implementation. But we will be using the
>>HighSpeed serial IOs. So it pretty much depends on the how the FFT is
>>implemented -- we could even use external fast demuxes if necessary.
>
>>Also, the input is complex.
>
>>As mentioned earlier I should have 64 samples output per clock cycle. As
a
>>reminder I am doing 4096 pt. fft.
>
>Very useful information.
>
>I'm a little concerned about the rest of your post because I'm
>presuming that your input data is sequential.  That is, 8 bits each of
>points 0, 1, 2, 3 ... 63, followed by points 64, 65, ... 127, then
>128 ... 191, then 192 ... 255, all the way up to 4095.  So you have 64
>eight bit sequential inputs per clock tick at 375 Mhz (512 bits
>total).
>
>Here's the problem.  With a 4096 point FFT, you've got 2 stages of 64
>point butterflies with 1 column of inter-stage twiddles between them.
>No matter what the input/output order, geometry type or decimation,
>the top most butterfly in the first stage uses the following inputs:
>0, 64, 128, 192 ... 4032 (every 64th input - the last one listed is
>4096-64 = 4032).  The next butterfly down from the top in the first
>stage uses the same sequence above PLUS 1 (1, 65, 129, 193 ... 4033).
>The next butterfly down uses  a PLUS 2 (2, 66, 130, 194, ... 4034).
>Once again, samples separated by 64.
>
>If you need any clarification, I can send a 64 point 8x8 bitmap
>(or .pdf or whatever) via 'reply to author'.  It's not at all
>difficult to explain what a 64x64 graph looks like after viewing the
>8x8 (and I can send the explanation, too).
>
>Now you can break down the a 4096 many different ways, but you'll
>still have the same problem of how to feed the first stage of the
>pipeline with the appropriate grouping of inputs.  I was concerned
>because you post kind of implied that you've already got your 4096
>data coming in as if it were in bit-reversed order.  Bit-reversing
>your 4096 inputs when they're coming in 64 at a time is not trivial
>and requires some careful thought.  A double buffer would work, but
>it's extra hardware.
>
>Not to worry.  I'm sure it can be done, but the devil's in the
>details.
>
>I've also got a couple of other suggestions, but I'll post them later,
>and think about your buffering problem in the meantime.
>
>You've got an interesting problem.
>
>Kevin McGee
>
>


Actually for the ordering of the input -- I have always thought of using
double buffers. Extra hardware -- which here is memory -- is abundant on
the FPGA (38000 Kb). It will increase latency -- which is not an issue.

Thanks a lot.

Onkar

Reply by kevin ●July 6, 20102010-07-06

I looked at some of the FFT cores being offered by various vendors.
Your requirements are quite steep (and I've always thought it's more
fun to do your own).  But you might want to Google "virtex and FFT and
cores" and check them out.  You can get an idea of how others are
handling the I/O.  And most seem to be doing radix 2, with a few radix
4 or mixed radix (and one NxM - arbitrary numbers).

I have more to add after my previous post, but it's late here, so I'll
post again later today.

Kevin McGee

Reply by kevin ●July 7, 20102010-07-07

OK. I'm convinced that there are at least several ways of
manipulating your inputs such that you can sequence them properly and
process them in the first stage 64 point pipelines. The only question
is which way is most efficient.

As for the structure of the first stage 64, you were wondering if the
higher radix versions would result in longer wiring routes. Well,
you're going to have very long wiring routes even for a radix 2
pipeline. Consider that your input data (as you described) consists
of 64 eight bit words arriving simultaneously (512 bits total). Each
stage of a 64 point radix 2 pipeline contains 32 butterflies.
Presumably, they're done in parallel to meet the required throughput
speed. And since you're doing things with complex inputs, you
actually have 512 + 512 = 1024 bits being processed per clock tick
(or, 64 complex 8 bit words per tick). That's a lot of wires. And
because of the nature of the FFT, some of those wires are going to be
crossing other wires to get to the proper butterfly (it depends on the
exact algorithm, but every single algorithm is going to require it).

As an aside, you might consider taking a look at the correspondence I
mentioned some time before:

K. J. McGee, &#4294967295;Comments on &#4294967295;A 64-Point Fourier Transform Chip for Video
Motion Compensation Using Phase Correlation,&#4294967295; IEEE J. Solid-State
Circuits, vol. 33, no. 6, June 1998, pp. 928-932.

Note in the abstract where I state: lower radix algorithms can be
modified to be computationally equivalent to higher radix algorithms.
Then I describe why in section III of the above.

It's a true statement, and I figured it out in 1981 after manipulating
a lot of FFT graphs. For instance, any 64 point radix 2 graph can be
modified such that it contains the EXACT SAME number of adds,
subtracts and multiplies as in a 64 point radix 8 graph. So lower
radix efficiency can be made = to higher radix efficiency.

So, for me at least, the choice of radix has more to do with what
hardware resources are available, and which arrangement is most
efficient.

So you could proceed by picking a particular 64 point graph. Start
with radix 2, if you want, and see just how much in the way of
resources you'll need to implement it in full parallel form.

There's yet other important thing to note. Since you seem to be
fairly flexible as to your inputs, you might seriously consider doing
things in a serial fashion. Suppose your data were coming in as 512
bits at a time. Then, after 8 ticks of your 375 Mhz clock, you'd have
512 eight bit words of data, and you'd process them serially.

Whether your data is coming in parallel or serial, you should be able
to get the same high throughput that you require. Serial processing
may help you decrease the amount of long wire routes. It's something
that you should consider.

Kevin McGee

Reply by onkars ●July 9, 20102010-07-09

Thanks again for your time.

>There's yet other important thing to note.  Since you seem to be
>fairly flexible as to your inputs, you might seriously consider doing
>things in a serial fashion.  Suppose your data were coming in as 512
>bits at a time.  Then, after 8 ticks of your 375 Mhz clock, you'd have
>512 eight bit words of data, and you'd process them serially.
>
>Whether your data is coming in parallel or serial, you should be able
>to get the same high throughput that you require.  Serial processing
>may help you decrease the amount of long wire routes.  It's something
>that you should consider.
>


I am not exactly sure what you mean by serial. Does it mean having multiple
smaller pipeline FFT engines -- each doing a 4096 point FFT but fed
serially -- instead of 1 big 64 path pipelined fft. If this is what you
meant -- I actually mentioned this in my previous posts -- actually these
were part of the options I am contemplating. See below (extract from
previous post):
"USE **4** FFTs made of 16-path radix-16 using 3 stages (full row and
column
parallel implementation of radix-16) 
OR
USE **8** FFTs made of 8-path radix-8 (full row and column parallel
implementation
of radix-8) using 4 stages"

Whenever I speak of smaller radix(r), I mean each FFT pipeline stage of the
4096 radix-r fft will have this 1 radix implemented as full parallel MDC
structure (which will internally be built of radix-2 type structures e.g.
radix-4 will internally have 4 radix-2s making a radix-4 have 8 complex
additions and finally 3 complex multiplications). So, this implies that
whenever I say I will use small radix-r it means I will have to use more
number of radix-r FFT units. 

Hence I say:
"Using smaller radices has following disadvantages ---
The number of stages and required number of FFTs (to achieve 64
samples/clock) increases. Reasonably assuming that the number arithmetic
units for any approach is same --- this means that the control overhead
(scheduling logic + commutators using logic not mem. etc.) is increased. I
am not sure how much percentage this is.

Advantage of smaller radices is that the routes will be shorter and
higher frequencies can be achievable. Will the routes when using a single
64-path radix-64 solution be longer than when using 4 different radix-16
FFTs each designed as having 16-paths? -- I think they will.
??

Previous 123 Next

64 x 64 FFT algorithm for 4096 pt. FFT

Sign in

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Discussion Groups

Quick Links

About DSPRelated.com

Social Networks

The Related Media Group