DSPRelated.com
Forums

64 x 64 FFT algorithm for 4096 pt. FFT

Started by onkars June 29, 2010
>On 6/29/2010 6:22 PM, onkars wrote: >> I am looking for an FFT algorithm that can give me a 4096 pt. FFT using
t=
>wo >> 64 pt. FFTs. Is it the mixed radix FFT? >... plus second post of 7:28 PM > >4096=3D64x64 is not mixed radix. It's 2 stages of radix 64 butterflies, >so it's radix 64. Mixed radix would be 4096 =3D 128x32, or 32x128, or >256x16, or 16x256, or ... (it's a long list). > >Based on some of your previous posts, I presume that you're looking to >do this in hardware. In that case, a 64x64 isn't a bad choice (there >are better ones), because you can break the 64x64 down into a 4 stage >pipeline of 8x8x8x8. Radix 8 is quite efficient in that you only have >a couple of 'difficult' internal twiddles of the +/-.707.... type, and >they can be done efficiently with a special purpose multiplier. So >you'd have 4 radix 8 butterfly stages (each with a special purpose +/-. >707... multiplier), plus 3 full scale multipliers for the inter-stage >twiddles. > >There are a lot of additional concepts that could be applied, but I'm >not really sure exactly what you're trying to do, so I won't get into >them. > >To get an idea of the basic concepts, you may want to look at: > >L. R. Rabiner, B. Gold, Theory and Application of Digital Signal >Processing. Englewood Cliffs, NJ, Prentice-Hall, 1975. > >Chapters 6 and 10 in the above deal with FFT and some basic pipeline >architectures. Although the book is out of print, you may be able to >find it at a local university library, or on EBAY. > >As for the twiddles between the stages, they're fairly simple to >figure out. I could tell you what they are for a 64x64 graph, or an >8x8x8x8, but then I'd be doing all the work, and you wouldn't really >learn anything (here's a hint: note the pattern of the input/output >order and twiddle sequence of each butterfly for the 8x8 graph in Fig. >10.15 on p. 587 in Rabiner and Gold's book. Now imagine what the >graph would look like if you had 2 stages of 64x64 butterflies). > >If you get through all of the above and feel a little ambitious, you >might want to take a look at: > >K. J. McGee, =93Comments on =91A 64-Point Fourier Transform Chip for
Video
>Motion Compensation Using Phase Correlation,=94 IEEE J. Solid-State >Circuits, vol. 33, no. 6, June 1998, pp. 928-932. > >In particular, pay attention to Section III in the above: Equivalence >of Higher and Lower Radix Graphs. It has important implications for >shuffle exchange algorithms and pipeline architectures (and, as noted, >I originally developed the concept in 1981). > >Kevin McGee >
Hi Kevin, Thanks for your detailed response. Actually I am working on building a 4096 pt. FFT processor (8 bit input with scaling) which is aimed at athroughtput of 20+ GigaSamples/s. I am targeting this design for the Virtex 6 FPGA. With some initial resource estimation and experiments with the Xilinx core FFT, I realized that the issue with an FPGA was not so much multipliers and memory -- it is the number of slices that are the bottleneck. For Virtex 6 I can use the DSP48s for complex multiplication (4 per complex mult) -- Virtex 6 has 2000 of these. And it also has abundant Block RAMs. So I have to efficiently use the slices -- saving adders, registers could help this. I figured out that I could realistically achieve 350 MHz on the FPGA -- which implies that to achieve 20 Gsamples/s I need approx. 64 samples output per clock cycle @350 MHz. So my goals sort of become: 1. Minimize slice usage -- even at the cost of extra memory usage or slightly more mults (implies reducing adders and registers). 2. Reduce long routes/wires which could bring the max clock freq. below 350 MHz. 3. one thing is for sure -- use MDC structures -- using SDF and SDC structures saves memory, which is not our goal. And the options that I consider: 1. Use 64x64 structure -- with reordering memory (RAMs) in between the 2 64 stages. Also, here I intend to use the full-blown parallel-pipelined (i.e. full row and full column parallelism) implementation of 64 stage (internally using maybe radix 4 i.e. 2-squared implemenation). The advantages of this arch is that control will be minimal and simple. Also, single structure computes the FFT @350 MHz hence the inputs will always arrive on the same pins -- which feed this FFT. The downside is that the routes will be long -- and may implact the ability to achieve 350 MHz. 2. Another approach is using X number FFTs (unable to figure out X and fft length) in parallel (each implemented as radix 4 MDC). And these X parallel FFTs feeding the 64 inputs of single 64 pt. FFT implemented full blown (full parallelism) -- which eventually gives 4096 pt. FFT. 3. Use 8x8x8x8 MDC structure FFT. Will need to use 8 of these to get 64 outputs per cycle(with input demux feeding the 8 FFTs). The radix 8 stages will be implemented using 2^3 (or 2-cube radix). Good point is shorter routes, but more commutators -- making it complex. Another issue is the input buffering will have to be managed to achieve 100% utilization. Also, there will have to be some kind of demux. between the input pins and 8 different FFTs OR use 8 different input pin sets and external demux. 4. similar to (2) but use 4x4x4x4x4x4 MDC. Use 16 of these. Advantage is even smaller routes. Disadvantages -- more complex commutators and control. Input pins to 16 FFT routing. Do you have any suggestions? Any comments or any other different architecture or some pointers/references I could use? Thank you. Onkar
On Jul 1, 5:54=A0pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:

(snip)

Some general comments first:

There are a great many pipeline FFT architectures (>100, at least).
You should thoroughly review the patent literature, journals and
conference proceedings.

Not all pipelines are created equal.  Back in the 1980's, I did a
comparative analysis of many architectures.  Some use less memory than
others.  Some allow for much higher input/output data speeds (i.e.:
given a certain technology speed, what's the highest clock rate you
can use?).  In some, the arithmetic units operate only 50% or 25% of
the time.  For one of the architectures, I had to view the original
patent, rather than the published articles, because I determined that
it couldn't possibly work unless you had some extra registers at the
input/output of the arithmetic units (sure enough, they're right there
in the patent, but none of the published articles show them).

You might want to take a look at Fig. 3 in the previously mentioned
1998 journal article of mine.  It shows a 3 stage 64 point 4x4x4.
It's 100% efficient in that all arithmetic units operate with no dead
time.  And one nice thing about it is that the outputs are in line
sequential order (see the figure in the paper for the sequence).

You could either use that as your first 64 point, dump the outputs
into a buffer, and then use the another 4x4x4 for the second 64
point.  Or you could design a 16x16x16.  Your arithmetic unit in that
case will have to handle the +/- .707... twiddles, plus the couple of
others that you get for a 16 point FFT.

As for some of your specific comments:

>> Reduce long routes/wires which could bring the max clock freq. below 350=
MHz. A persistent problem. In the late 1980's, I took a course at LSI Logic for designing with their gate array. It was an R+D project, so we didn't have the money in the budget to pay for a mask charge (100K), but I was able to use their design tools to simulate a 64 point pipeline that I had designed. It took a lot of clever ideas to get it to operate at a 3 nanosecond clock (folding shift registers back on themselves so that data went in and out of them at spots near the arithmetic units; fully pipelining both the adders and multipliers so I could feed them at the maximum clock speed; using the pipelining delay through the adders/multiplier such that they effectively doubled as additional storage) . Although the pipeline could handle it, the I/ O pins would only allow for 5 nanosecond speeds. You're using an FPGA, so your problems could actually be worse in that you probably don't have access to the lower level design elements that I had.
>> Use 64x64 structure -- with reordering memory (RAMs) in between the 2 64 >> stages. Also, here I intend to use the full-blown parallel-pipelined (i.=
e.
>> full row and full column parallelism)
The only time that I would consider using an array architecture like that is if there were no other way to meet the speed requirement.
>> but more commutators
In custom NMOS and CMOS, they're not too difficult. Take a look at the aforementioned 1998 paper and check out reference 4 (US patent 4396994): http://www.freepatentsonline.com/4396994.html The commutator operation can be seen and implemented as a barrel shifter (see Fig. 3 in the above paper). However, I'm not sure that you'll have access to the low level design elements that you'd need when using an FPGA. You may be forced into using mux or demux.
>> similar to (2) but use 4x4x4x4x4x4 MDC. Use 16 of these. Advantage is >> even smaller routes. Disadvantages -- more complex commutators and >> control. Input pins to 16 FFT routing.
Alright. I'd have to think about it. I'll look at your post again tonight. Kevin McGee
On Jul 1, 5:54 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:

>I figured out that I could realistically achieve 350 MHz on the FPGA -- >which implies that to achieve 20 Gsamples/s I need approx. 64 samples >output per clock cycle @350 MHz.
OK. I get it now why you are probably going to have to use an array type arrangement to do the 64 point. But I'm not clear exactly how your data is coming into the VIRTEX slices - are you using the Ghz serial port? And how is the input data partitioned and/or sequenced? Sequential in, I suppose. You mentioned the need for 100% utilization for the input buffering. Quite possibly you may find the following to be helpful. First, things like bit reversal or matrix transpose are fairly easy out of a RAM. For instance, in several bit reverse in/sequential out architectures, the data is first bit reversed by reading/writing out of RAM. Consider the simple case of an 8 point FFT. Data points 0 to 7 are loaded into sequential RAM locations, with the addressing controlled by a counter. Now the counter has a 2 to 1 mux on every bit output. Things are arranged such that the counter output is either in sequential mode (i.e.: count from 0 to 7), or in bit reversed mode (count in bit reversed order). When the second set of 8 data points arrives, you do a read/write of the location. The data being read out will be the bit reversed sequence of your first 8 points. The second set of data will actually be stored in the bit reversed locations. When the third set of 8 points arrives, you read/ write again with the counter in sequential mode. The second set of data will also come out in bit reversed order. All you need do is start out correctly and switch the counter mode every 8 data samples. It actually works - I've done things that way for a very long time. Try it using the 8 point example, it's really not at all difficult to understand. Here's another thing about the above point: take a look at the shift cell / commutator arrangement that you see in a great many pipeline architectures. They're all bit reversers or matrix transposers. If you have a fast enough RAM that can tolerate the input speeds, you might consider doing things that way. All you have to do is set up the counter properly. Advantage: no more commutators. And just another thought: presuming your input data is real only, have you considered using the '2N real FFT with an N complex FFT' technique. That would cut your FFT size down to 2048. There's a little bit of extra processing involved, but it's roughly equivalent to adding a column of butterflies at the end. I really haven't had enough time to look at your post thoroughly. I'll spend some time over the weekend on it. I also spent a little time scanning the VIRTEX specs. I could suggest other things or different architectures (or make better comments on your suggestions), but I'd need to have a clearer idea of just exactly how that input data is coming in. Kevin McGee
On Jul 3, 1:38=A0am, kevin <kevinjmc...@netscape.net> wrote:
> On Jul 1, 5:54 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com> > wrote: > > >I figured out that I could realistically achieve 350 MHz on the FPGA -- > >which implies that to achieve 20 Gsamples/s I need approx. 64 samples > >output per clock cycle @350 MHz. > > OK. =A0I get it now why you are probably going to have to use an array > type arrangement to do the 64 point. > > But I'm not clear exactly how your data is coming into the VIRTEX > slices - are you using the Ghz serial port? =A0And how is the input data > partitioned and/or sequenced? =A0Sequential in, I suppose. > > You mentioned the need for 100% utilization for the input buffering. > Quite possibly you may find the following to be helpful. =A0First, > things like bit reversal or matrix transpose are fairly easy out of a > RAM. =A0For instance, in several bit reverse in/sequential out > architectures, the data is first bit reversed by reading/writing out > of RAM. =A0Consider the simple case of an 8 point FFT. =A0Data points 0 t=
o
> 7 are loaded into sequential RAM locations, with the addressing > controlled by a counter. =A0Now the counter has a 2 to 1 mux on every > bit output. =A0Things are arranged such that the counter output is > either in sequential mode (i.e.: count from 0 to 7), or in bit > reversed mode (count in bit reversed order). =A0When the second set of 8 > data points arrives, you do a read/write of the location. =A0The data > being read out will be the bit reversed sequence of your first 8 > points. =A0The second set of data will actually be stored in the bit > reversed locations. =A0When the third set of 8 points arrives, you read/ > write again with the counter in sequential mode. =A0The second set of > data will also come out in bit reversed order. =A0All you need do is > start out correctly and switch the counter mode every 8 data samples. > It actually works - I've done things that way for a very long time. > Try it using the 8 point example, it's really not at all difficult to > understand. > > Here's another thing about the above point: =A0take a look at the shift > cell / commutator arrangement that you see in a great many pipeline > architectures. =A0They're all bit reversers or matrix transposers. If > you have a fast enough RAM that can tolerate the input speeds, you > might consider doing things that way. =A0All you have to do is set up > the counter properly. =A0Advantage: no more commutators. > > And just another thought: presuming your input data is real only, have > you considered using the '2N real FFT with an N complex FFT' > technique. =A0That would cut your FFT size down to 2048. =A0There's a > little bit of extra processing involved, but it's roughly equivalent > to adding a column of butterflies at the end. > > I really haven't had enough time to look at your post thoroughly. > I'll spend some time over the weekend on it. =A0I also spent a little > time scanning the VIRTEX specs. > > I could suggest other things or different architectures (or make > better comments on your suggestions), but I'd need to have a clearer > idea of just exactly how that input data is coming in. > > Kevin McGee
Meant to say: the counter outputs go to a 2 to 1 mux such that you can select either sequential output or bit reversed output.
>On Jul 3, 1:38=A0am, kevin <kevinjmc...@netscape.net> wrote: >> On Jul 1, 5:54 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com> >> wrote: >> >> >I figured out that I could realistically achieve 350 MHz on the FPGA
--
>> >which implies that to achieve 20 Gsamples/s I need approx. 64 samples >> >output per clock cycle @350 MHz. >> >> OK. =A0I get it now why you are probably going to have to use an array >> type arrangement to do the 64 point. >> >> But I'm not clear exactly how your data is coming into the VIRTEX >> slices - are you using the Ghz serial port? =A0And how is the input
data
>> partitioned and/or sequenced? =A0Sequential in, I suppose. >> >> You mentioned the need for 100% utilization for the input buffering. >> Quite possibly you may find the following to be helpful. =A0First, >> things like bit reversal or matrix transpose are fairly easy out of a >> RAM. =A0For instance, in several bit reverse in/sequential out >> architectures, the data is first bit reversed by reading/writing out >> of RAM. =A0Consider the simple case of an 8 point FFT. =A0Data points 0
t=
>o >> 7 are loaded into sequential RAM locations, with the addressing >> controlled by a counter. =A0Now the counter has a 2 to 1 mux on every >> bit output. =A0Things are arranged such that the counter output is >> either in sequential mode (i.e.: count from 0 to 7), or in bit >> reversed mode (count in bit reversed order). =A0When the second set of
8
>> data points arrives, you do a read/write of the location. =A0The data >> being read out will be the bit reversed sequence of your first 8 >> points. =A0The second set of data will actually be stored in the bit >> reversed locations. =A0When the third set of 8 points arrives, you
read/
>> write again with the counter in sequential mode. =A0The second set of >> data will also come out in bit reversed order. =A0All you need do is >> start out correctly and switch the counter mode every 8 data samples. >> It actually works - I've done things that way for a very long time. >> Try it using the 8 point example, it's really not at all difficult to >> understand. >> >> Here's another thing about the above point: =A0take a look at the shift >> cell / commutator arrangement that you see in a great many pipeline >> architectures. =A0They're all bit reversers or matrix transposers. If >> you have a fast enough RAM that can tolerate the input speeds, you >> might consider doing things that way. =A0All you have to do is set up >> the counter properly. =A0Advantage: no more commutators. >> >> And just another thought: presuming your input data is real only, have >> you considered using the '2N real FFT with an N complex FFT' >> technique. =A0That would cut your FFT size down to 2048. =A0There's a >> little bit of extra processing involved, but it's roughly equivalent >> to adding a column of butterflies at the end. >> >> I really haven't had enough time to look at your post thoroughly. >> I'll spend some time over the weekend on it. =A0I also spent a little >> time scanning the VIRTEX specs. >> >> I could suggest other things or different architectures (or make >> better comments on your suggestions), but I'd need to have a clearer >> idea of just exactly how that input data is coming in. >> >> Kevin McGee > >Meant to say: the counter outputs go to a 2 to 1 mux such that you can >select either sequential output or bit reversed output. >
Thank you. I really appreciate your time. As for the inputs to the FPGA -- thats an open aspect too. It can be designed to best suite the FFT implementation. But we will be using the HighSpeed serial IOs. So it pretty much depends on the how the FFT is implemented -- we could even use external fast demuxes if necessary. Also, the input is complex. As mentioned earlier I should have 64 samples output per clock cycle. As a reminder I am doing 4096 pt. fft. So this can be done using various options e.g. one FFT made of 64-path radix-64 using 2 stages (full row and column parallel implementation of radix-64) OR 4 FFTs made of 16-path radix-16 using 3 stages (full row and column parallel implementation of radix-16) OR 8 FFTs made of 8-path radix-8 (full row and column parallel implementation of radix-8) using 4 stages Using smaller radices has following disadvantages --- The number of stages and required number of FFTs (to achieve 64 samples/clock) increases. Reasonably assuming that the number arithmetic units for any approach is same --- this means that the control overhead (scheduling logic + commutators using logic not mem. etc.) is increased. I am not sure how much percentage this is. Advantage of smaller radices is that the routes will be shorter (???) and higher frequencies can be achievable. Will the routes when using a single 64-path radix-64 solution be longer than when using 4 16-path radix-16 ffts ?? Somehow I am inclined towards going for the 16 path radix 16 solution, and aim to achieve 250 to 300 MHz. I have seen a paper (credible) claim they achieved 40 MHz for a fully parallel (row and column) 256 pt. fft using 1024 radix 2 butterflies on Virtex 5.
On Jul 3, 11:09 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:

>As for the inputs to the FPGA -- thats an open aspect too. It can be >designed to best suite the FFT implementation. But we will be using the >HighSpeed serial IOs. So it pretty much depends on the how the FFT is >implemented -- we could even use external fast demuxes if necessary.
>Also, the input is complex.
>As mentioned earlier I should have 64 samples output per clock cycle. As a >reminder I am doing 4096 pt. fft.
Very useful information. I'm a little concerned about the rest of your post because I'm presuming that your input data is sequential. That is, 8 bits each of points 0, 1, 2, 3 ... 63, followed by points 64, 65, ... 127, then 128 ... 191, then 192 ... 255, all the way up to 4095. So you have 64 eight bit sequential inputs per clock tick at 375 Mhz (512 bits total). Here's the problem. With a 4096 point FFT, you've got 2 stages of 64 point butterflies with 1 column of inter-stage twiddles between them. No matter what the input/output order, geometry type or decimation, the top most butterfly in the first stage uses the following inputs: 0, 64, 128, 192 ... 4032 (every 64th input - the last one listed is 4096-64 = 4032). The next butterfly down from the top in the first stage uses the same sequence above PLUS 1 (1, 65, 129, 193 ... 4033). The next butterfly down uses a PLUS 2 (2, 66, 130, 194, ... 4034). Once again, samples separated by 64. If you need any clarification, I can send a 64 point 8x8 bitmap (or .pdf or whatever) via 'reply to author'. It's not at all difficult to explain what a 64x64 graph looks like after viewing the 8x8 (and I can send the explanation, too). Now you can break down the a 4096 many different ways, but you'll still have the same problem of how to feed the first stage of the pipeline with the appropriate grouping of inputs. I was concerned because you post kind of implied that you've already got your 4096 data coming in as if it were in bit-reversed order. Bit-reversing your 4096 inputs when they're coming in 64 at a time is not trivial and requires some careful thought. A double buffer would work, but it's extra hardware. Not to worry. I'm sure it can be done, but the devil's in the details. I've also got a couple of other suggestions, but I'll post them later, and think about your buffering problem in the meantime. You've got an interesting problem. Kevin McGee
> >On Jul 3, 11:09 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com> >wrote: > >>As for the inputs to the FPGA -- thats an open aspect too. It can be >>designed to best suite the FFT implementation. But we will be using the >>HighSpeed serial IOs. So it pretty much depends on the how the FFT is >>implemented -- we could even use external fast demuxes if necessary. > >>Also, the input is complex. > >>As mentioned earlier I should have 64 samples output per clock cycle. As
a
>>reminder I am doing 4096 pt. fft. > >Very useful information. > >I'm a little concerned about the rest of your post because I'm >presuming that your input data is sequential. That is, 8 bits each of >points 0, 1, 2, 3 ... 63, followed by points 64, 65, ... 127, then >128 ... 191, then 192 ... 255, all the way up to 4095. So you have 64 >eight bit sequential inputs per clock tick at 375 Mhz (512 bits >total). > >Here's the problem. With a 4096 point FFT, you've got 2 stages of 64 >point butterflies with 1 column of inter-stage twiddles between them. >No matter what the input/output order, geometry type or decimation, >the top most butterfly in the first stage uses the following inputs: >0, 64, 128, 192 ... 4032 (every 64th input - the last one listed is >4096-64 = 4032). The next butterfly down from the top in the first >stage uses the same sequence above PLUS 1 (1, 65, 129, 193 ... 4033). >The next butterfly down uses a PLUS 2 (2, 66, 130, 194, ... 4034). >Once again, samples separated by 64. > >If you need any clarification, I can send a 64 point 8x8 bitmap >(or .pdf or whatever) via 'reply to author'. It's not at all >difficult to explain what a 64x64 graph looks like after viewing the >8x8 (and I can send the explanation, too). > >Now you can break down the a 4096 many different ways, but you'll >still have the same problem of how to feed the first stage of the >pipeline with the appropriate grouping of inputs. I was concerned >because you post kind of implied that you've already got your 4096 >data coming in as if it were in bit-reversed order. Bit-reversing >your 4096 inputs when they're coming in 64 at a time is not trivial >and requires some careful thought. A double buffer would work, but >it's extra hardware. > >Not to worry. I'm sure it can be done, but the devil's in the >details. > >I've also got a couple of other suggestions, but I'll post them later, >and think about your buffering problem in the meantime. > >You've got an interesting problem. > >Kevin McGee > >
Actually for the ordering of the input -- I have always thought of using double buffers. Extra hardware -- which here is memory -- is abundant on the FPGA (38000 Kb). It will increase latency -- which is not an issue. Thanks a lot. Onkar
I looked at some of the FFT cores being offered by various vendors.
Your requirements are quite steep (and I've always thought it's more
fun to do your own).  But you might want to Google "virtex and FFT and
cores" and check them out.  You can get an idea of how others are
handling the I/O.  And most seem to be doing radix 2, with a few radix
4 or mixed radix (and one NxM - arbitrary numbers).

I have more to add after my previous post, but it's late here, so I'll
post again later today.

Kevin McGee
OK.  I'm convinced that there are at least several ways of
manipulating your inputs such that you can sequence them properly and
process them in the first stage 64 point pipelines.  The only question
is which way is most efficient.

As for the structure of the first stage 64, you were wondering if the
higher radix versions would result in longer wiring routes.  Well,
you're going to have very long wiring routes even for a radix 2
pipeline.  Consider that your input data (as you described) consists
of 64 eight bit words arriving simultaneously (512 bits total).  Each
stage of a 64 point radix 2 pipeline contains 32 butterflies.
Presumably, they're done in parallel to meet the required throughput
speed.  And since you're doing things with complex inputs, you
actually have 512 + 512 = 1024 bits being processed per clock tick
(or, 64 complex 8 bit words per tick).  That's a lot of wires.  And
because of the nature of the FFT, some of those wires are going to be
crossing other wires to get to the proper butterfly (it depends on the
exact algorithm, but every single algorithm is going to require it).

As an aside, you might consider taking a look at the correspondence I
mentioned some time before:

K. J. McGee, &#4294967295;Comments on &#4294967295;A 64-Point Fourier Transform Chip for Video
Motion Compensation Using Phase Correlation,&#4294967295; IEEE J. Solid-State
Circuits, vol. 33, no. 6, June 1998, pp. 928-932.

Note in the abstract where I state: lower radix algorithms can be
modified to be computationally equivalent to higher radix algorithms.
Then I describe why in section III of the above.

It's a true statement, and I figured it out in 1981 after manipulating
a lot of FFT graphs.  For instance, any 64 point radix 2 graph can be
modified such that it contains the EXACT SAME number of adds,
subtracts and multiplies as in a 64 point radix 8 graph.  So lower
radix efficiency can be made = to higher radix efficiency.

So, for me at least, the choice of radix has more to do with what
hardware resources are available, and which arrangement is most
efficient.

So you could proceed by picking a particular 64 point graph.  Start
with radix 2, if you want, and see just how much in the way of
resources you'll need to implement it in full parallel form.

There's yet other important thing to note.  Since you seem to be
fairly flexible as to your inputs, you might seriously consider doing
things in a serial fashion.  Suppose your data were coming in as 512
bits at a time.  Then, after 8 ticks of your 375 Mhz clock, you'd have
512 eight bit words of data, and you'd process them serially.

Whether your data is coming in parallel or serial, you should be able
to get the same high throughput that you require.  Serial processing
may help you decrease the amount of long wire routes.  It's something
that you should consider.

Kevin McGee
Thanks again for your time.

>There's yet other important thing to note. Since you seem to be >fairly flexible as to your inputs, you might seriously consider doing >things in a serial fashion. Suppose your data were coming in as 512 >bits at a time. Then, after 8 ticks of your 375 Mhz clock, you'd have >512 eight bit words of data, and you'd process them serially. > >Whether your data is coming in parallel or serial, you should be able >to get the same high throughput that you require. Serial processing >may help you decrease the amount of long wire routes. It's something >that you should consider. >
I am not exactly sure what you mean by serial. Does it mean having multiple smaller pipeline FFT engines -- each doing a 4096 point FFT but fed serially -- instead of 1 big 64 path pipelined fft. If this is what you meant -- I actually mentioned this in my previous posts -- actually these were part of the options I am contemplating. See below (extract from previous post): "USE **4** FFTs made of 16-path radix-16 using 3 stages (full row and column parallel implementation of radix-16) OR USE **8** FFTs made of 8-path radix-8 (full row and column parallel implementation of radix-8) using 4 stages" Whenever I speak of smaller radix(r), I mean each FFT pipeline stage of the 4096 radix-r fft will have this 1 radix implemented as full parallel MDC structure (which will internally be built of radix-2 type structures e.g. radix-4 will internally have 4 radix-2s making a radix-4 have 8 complex additions and finally 3 complex multiplications). So, this implies that whenever I say I will use small radix-r it means I will have to use more number of radix-r FFT units. Hence I say: "Using smaller radices has following disadvantages --- The number of stages and required number of FFTs (to achieve 64 samples/clock) increases. Reasonably assuming that the number arithmetic units for any approach is same --- this means that the control overhead (scheduling logic + commutators using logic not mem. etc.) is increased. I am not sure how much percentage this is. Advantage of smaller radices is that the routes will be shorter and higher frequencies can be achievable. Will the routes when using a single 64-path radix-64 solution be longer than when using 4 different radix-16 FFTs each designed as having 16-paths? -- I think they will. ??