DSPRelated.com
Forums

64 x 64 FFT algorithm for 4096 pt. FFT

Started by onkars June 29, 2010
On Jul 9, 12:39 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:

>I am not exactly sure what you mean by serial.
By serial, I mean that all the arithmetic operations (+,-,x) are done in serially - one bit at a time. This obviously requires more arithmetic units to meet a given throughput, but the units are smaller, and the wiring between units is serial. But ignore that for now because there are other issues I'd like to address, and I hope they will answer the questions raised in your post. Consider a conventional 4096, radix 2, sequential in, bit reverse out FFT. Decimation-in-Time (DIT) (input leg twiddles). Let's say we use a fully parallel array to process it. We'd have 12 stages of butterflies, with 4096/2 (=2048) butterflies per stage. Our inputs will consist of 4096 complex words, which means that we have 2x4096 inputs of so many bits each. The twiddles in the 1st stage are unity (twiddle 0 = 1), the twiddles in the 2nd stage are 0 (=1) and +/- j (sign change and swap). The 3rd stage has 0, +/-j and +/-.707 types. And we could implement the +/-.707 types with special hardware. Later stages have more difficult twiddles, so perhaps we'd use full multipliers to do them. The above is inefficient because, as I mentioned before, you can modify a lower radix graph to be just as computationally efficient as a higher radix graph. In doing that for the above example, we'd have twiddle 0's in the 1st stage, 0 and +/- j in the 2nd, 0,+/- j in the 3rd, 0 in the 4th, 0 and +/- j in the 5th, 0, +/- j in the 6th, etc. We'd ALSO have full multipliers between the 3rd and 4th stage, the 6th and 7th stage, and the 9th and 10th stage. Overall, it's very similar to having 4 stages of a radix 8 graph, or 2 stages off a radix 64 graph (8x8x8x8 = 64x64 = 4096). But our STRUCTURE is still radix 2. And some of the multiplies in the columns of full multipliers will be of the 0, +/- j and +/- .707 types. The net result is a substantial reduction of full multipliers compared to the naive radix 2 approach. Now consider using a higher radix structure, say, radix 8. You'd have 4 stages of radix 8 butterflies, with 4096/8 (=512) butterflies per stage. But remember that each radix 8 butterfly consists of 3 stages of four radix 2 butterflies. So I have a little bit of difficulty with your statement:
>Hence I say: >"Using smaller radices has following disadvantages --- >The number of stages and required number of FFTs (to achieve 64 >samples/clock) increases.
I think you're presuming that a radix 8 computes in the same amount of time as a radix 2. Shouldn't 3 radix 2 stages (an 8 point FFT) compute in the same amount of time as1 radix 8 stage (an 8 point FFT)? I guess it may depend on how tightly packed the elements are (+, -, x). Your other statement is:
>Reasonably assuming that the number arithmetic >units for any approach is same
Yes, with emphasis on ANY APPROACH. With a 4096 graph, the computational efficiency (number of +, -, x) of a radix 2 graph can be made identical to a radix 8 or radix 64 graph. As to whether lower or higher radix implementations have shorter wiring routes - that's a good question. Remember that with higher radix, you're basically grouping a larger number of radix 2 butterflies and drawing/calling it a radix 8 butterfly. So the wiring routes INTERNAL to the radix 8 butterfly may seem to decrease, and the wiring EXTERNAL to the butterfly may seem to increase. But maybe it only LOOKS that way because of the way in which FFT graphs are drawn. The graphs are reality in terms of processing sequence, but not necessarily in terms of physical layout. In full custom design, a critical factor is the height/width of the various elements you're dealing with. In one case I dealt with, the pads were huge compared to the size of the FFT I was doing, so it didn't much matter how tall or wide my adds/subtracts or multiplies were. Very long (slow) input/output wires converged to my tiny processor. I had spent all my time trying to figure out how to tightly space my processing elements (let's see, if I use a barrel shifter instead of multiplexers, I can lay out my commutator design with Nlog2(N) transistors, versus a lot more for an 8x8 multiplexer, and the height fits the adders much better). Talk about the forest and the trees. You have a different problem because you're using an FPGA (I've used them too, but I find that I have to think differently to partition things properly).
>Will the routes when using a single >64-path radix-64 solution be longer than when using 4 different radix-16 >FFTs each designed as having 16-paths? -- I think they will.
On an FPGA, I guess I'd agree with that. But I'm not certain. One other thing troubles me. You're presuming 64 eight bit inputs at a time, and a 64 point array to do the two stages you need (input shuffling/reordering, 1st 64 point array FFT, second stage shuffling/ reordering, 2nd 64 point array, possibly followed by output reordering, depending how the 2 shuffling/compute stages are designed). It just seems to me that, for complex valued inputs and a 64 point array, you'd need 128 inputs at a time. 64 real and 64 imaginary. With 8 bits each, that's 128x8 = 1024 bits. You're presuming 64 eight point values per 375 Mhz clock tick (512 bits). Doesn't that mean that you're going to delay 512 bits by 1 clock cycle, and wait until you have 1024 bits to pump them into your array? That would cut your processing clock rate by half. Is this what you intend? I'd have more to say, but it's really late here. Kevin McGee
3rd paragraph, 2nd sentence should be  .. 0, +/- j, +/-.707 in the
6th, etc.

Kevin
>3rd paragraph, 2nd sentence should be .. 0, +/- j, +/-.707 in the >6th, etc. > >Kevin >
I dont intend that. I intend to have 64 complex values at a time. Maybe using the SERDES blocks in the FPGA IOs to do the parallelization in to 8 bits of a sample.
On Jul 10, 3:29=A0pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:
> >3rd paragraph, 2nd sentence should be =A0.. 0, +/- j, +/-.707 in the > >6th, etc. > > >Kevin > > I dont intend that. I intend to have 64 complex values at a time. Maybe > using the SERDES blocks in the FPGA IOs to do the parallelization in to 8 > bits of a sample.
OK. So you've got 64 complex values arriving per clock tick at 375 Mhz. That's 2x64x8 =3D 1024 bits. Presuming that your inputs are ordered/shuffled/delayed properly for a 4096 FFT, now you've got to do a 64 point fully parallel array architecture to compute your first stage. If you do it as radix 2, DIT, you'd have twiddles of 0 (=3D1) in the 1st stage, twiddles of 0 and +/-j in the 2nd, twiddles of 0, +/- j, and +/-.707 in the 3rd, 0 in the 4th, 0, +/-j in the 5th, and 0, +/-j, +/-.707 in the 6th. PLUS one column of full complex multiplies between the 3rd and 4th stages (same as a 64 point two stage radix 8 graph). And,since you're doing things fully parallel, some of those multiplies will be 0, +/-j and +/-.707 types, which don't require a full multiplier. You may decide to design it differently using a higher radix, which will affect your architecture, and quite possibly your input/ouput ordering, but that's up to you. For large FFT's, I've often found it to be more difficult to keep track of where I am in the larger graph, and get the input/output order correct for the pipeline stages (or, in your case, the 64 point arrays). Kevin McGee