Reply by kevin July 11, 20102010-07-11
On Jul 10, 3:29=A0pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:
> >3rd paragraph, 2nd sentence should be =A0.. 0, +/- j, +/-.707 in the > >6th, etc. > > >Kevin > > I dont intend that. I intend to have 64 complex values at a time. Maybe > using the SERDES blocks in the FPGA IOs to do the parallelization in to 8 > bits of a sample.
OK. So you've got 64 complex values arriving per clock tick at 375 Mhz. That's 2x64x8 =3D 1024 bits. Presuming that your inputs are ordered/shuffled/delayed properly for a 4096 FFT, now you've got to do a 64 point fully parallel array architecture to compute your first stage. If you do it as radix 2, DIT, you'd have twiddles of 0 (=3D1) in the 1st stage, twiddles of 0 and +/-j in the 2nd, twiddles of 0, +/- j, and +/-.707 in the 3rd, 0 in the 4th, 0, +/-j in the 5th, and 0, +/-j, +/-.707 in the 6th. PLUS one column of full complex multiplies between the 3rd and 4th stages (same as a 64 point two stage radix 8 graph). And,since you're doing things fully parallel, some of those multiplies will be 0, +/-j and +/-.707 types, which don't require a full multiplier. You may decide to design it differently using a higher radix, which will affect your architecture, and quite possibly your input/ouput ordering, but that's up to you. For large FFT's, I've often found it to be more difficult to keep track of where I am in the larger graph, and get the input/output order correct for the pipeline stages (or, in your case, the 64 point arrays). Kevin McGee
Reply by onkars July 10, 20102010-07-10
>3rd paragraph, 2nd sentence should be .. 0, +/- j, +/-.707 in the >6th, etc. > >Kevin >
I dont intend that. I intend to have 64 complex values at a time. Maybe using the SERDES blocks in the FPGA IOs to do the parallelization in to 8 bits of a sample.
Reply by kevin July 10, 20102010-07-10
3rd paragraph, 2nd sentence should be  .. 0, +/- j, +/-.707 in the
6th, etc.

Kevin
Reply by kevin July 10, 20102010-07-10
On Jul 9, 12:39 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:

>I am not exactly sure what you mean by serial.
By serial, I mean that all the arithmetic operations (+,-,x) are done in serially - one bit at a time. This obviously requires more arithmetic units to meet a given throughput, but the units are smaller, and the wiring between units is serial. But ignore that for now because there are other issues I'd like to address, and I hope they will answer the questions raised in your post. Consider a conventional 4096, radix 2, sequential in, bit reverse out FFT. Decimation-in-Time (DIT) (input leg twiddles). Let's say we use a fully parallel array to process it. We'd have 12 stages of butterflies, with 4096/2 (=2048) butterflies per stage. Our inputs will consist of 4096 complex words, which means that we have 2x4096 inputs of so many bits each. The twiddles in the 1st stage are unity (twiddle 0 = 1), the twiddles in the 2nd stage are 0 (=1) and +/- j (sign change and swap). The 3rd stage has 0, +/-j and +/-.707 types. And we could implement the +/-.707 types with special hardware. Later stages have more difficult twiddles, so perhaps we'd use full multipliers to do them. The above is inefficient because, as I mentioned before, you can modify a lower radix graph to be just as computationally efficient as a higher radix graph. In doing that for the above example, we'd have twiddle 0's in the 1st stage, 0 and +/- j in the 2nd, 0,+/- j in the 3rd, 0 in the 4th, 0 and +/- j in the 5th, 0, +/- j in the 6th, etc. We'd ALSO have full multipliers between the 3rd and 4th stage, the 6th and 7th stage, and the 9th and 10th stage. Overall, it's very similar to having 4 stages of a radix 8 graph, or 2 stages off a radix 64 graph (8x8x8x8 = 64x64 = 4096). But our STRUCTURE is still radix 2. And some of the multiplies in the columns of full multipliers will be of the 0, +/- j and +/- .707 types. The net result is a substantial reduction of full multipliers compared to the naive radix 2 approach. Now consider using a higher radix structure, say, radix 8. You'd have 4 stages of radix 8 butterflies, with 4096/8 (=512) butterflies per stage. But remember that each radix 8 butterfly consists of 3 stages of four radix 2 butterflies. So I have a little bit of difficulty with your statement:
>Hence I say: >"Using smaller radices has following disadvantages --- >The number of stages and required number of FFTs (to achieve 64 >samples/clock) increases.
I think you're presuming that a radix 8 computes in the same amount of time as a radix 2. Shouldn't 3 radix 2 stages (an 8 point FFT) compute in the same amount of time as1 radix 8 stage (an 8 point FFT)? I guess it may depend on how tightly packed the elements are (+, -, x). Your other statement is:
>Reasonably assuming that the number arithmetic >units for any approach is same
Yes, with emphasis on ANY APPROACH. With a 4096 graph, the computational efficiency (number of +, -, x) of a radix 2 graph can be made identical to a radix 8 or radix 64 graph. As to whether lower or higher radix implementations have shorter wiring routes - that's a good question. Remember that with higher radix, you're basically grouping a larger number of radix 2 butterflies and drawing/calling it a radix 8 butterfly. So the wiring routes INTERNAL to the radix 8 butterfly may seem to decrease, and the wiring EXTERNAL to the butterfly may seem to increase. But maybe it only LOOKS that way because of the way in which FFT graphs are drawn. The graphs are reality in terms of processing sequence, but not necessarily in terms of physical layout. In full custom design, a critical factor is the height/width of the various elements you're dealing with. In one case I dealt with, the pads were huge compared to the size of the FFT I was doing, so it didn't much matter how tall or wide my adds/subtracts or multiplies were. Very long (slow) input/output wires converged to my tiny processor. I had spent all my time trying to figure out how to tightly space my processing elements (let's see, if I use a barrel shifter instead of multiplexers, I can lay out my commutator design with Nlog2(N) transistors, versus a lot more for an 8x8 multiplexer, and the height fits the adders much better). Talk about the forest and the trees. You have a different problem because you're using an FPGA (I've used them too, but I find that I have to think differently to partition things properly).
>Will the routes when using a single >64-path radix-64 solution be longer than when using 4 different radix-16 >FFTs each designed as having 16-paths? -- I think they will.
On an FPGA, I guess I'd agree with that. But I'm not certain. One other thing troubles me. You're presuming 64 eight bit inputs at a time, and a 64 point array to do the two stages you need (input shuffling/reordering, 1st 64 point array FFT, second stage shuffling/ reordering, 2nd 64 point array, possibly followed by output reordering, depending how the 2 shuffling/compute stages are designed). It just seems to me that, for complex valued inputs and a 64 point array, you'd need 128 inputs at a time. 64 real and 64 imaginary. With 8 bits each, that's 128x8 = 1024 bits. You're presuming 64 eight point values per 375 Mhz clock tick (512 bits). Doesn't that mean that you're going to delay 512 bits by 1 clock cycle, and wait until you have 1024 bits to pump them into your array? That would cut your processing clock rate by half. Is this what you intend? I'd have more to say, but it's really late here. Kevin McGee
Reply by onkars July 9, 20102010-07-09
Thanks again for your time.

>There's yet other important thing to note. Since you seem to be >fairly flexible as to your inputs, you might seriously consider doing >things in a serial fashion. Suppose your data were coming in as 512 >bits at a time. Then, after 8 ticks of your 375 Mhz clock, you'd have >512 eight bit words of data, and you'd process them serially. > >Whether your data is coming in parallel or serial, you should be able >to get the same high throughput that you require. Serial processing >may help you decrease the amount of long wire routes. It's something >that you should consider. >
I am not exactly sure what you mean by serial. Does it mean having multiple smaller pipeline FFT engines -- each doing a 4096 point FFT but fed serially -- instead of 1 big 64 path pipelined fft. If this is what you meant -- I actually mentioned this in my previous posts -- actually these were part of the options I am contemplating. See below (extract from previous post): "USE **4** FFTs made of 16-path radix-16 using 3 stages (full row and column parallel implementation of radix-16) OR USE **8** FFTs made of 8-path radix-8 (full row and column parallel implementation of radix-8) using 4 stages" Whenever I speak of smaller radix(r), I mean each FFT pipeline stage of the 4096 radix-r fft will have this 1 radix implemented as full parallel MDC structure (which will internally be built of radix-2 type structures e.g. radix-4 will internally have 4 radix-2s making a radix-4 have 8 complex additions and finally 3 complex multiplications). So, this implies that whenever I say I will use small radix-r it means I will have to use more number of radix-r FFT units. Hence I say: "Using smaller radices has following disadvantages --- The number of stages and required number of FFTs (to achieve 64 samples/clock) increases. Reasonably assuming that the number arithmetic units for any approach is same --- this means that the control overhead (scheduling logic + commutators using logic not mem. etc.) is increased. I am not sure how much percentage this is. Advantage of smaller radices is that the routes will be shorter and higher frequencies can be achievable. Will the routes when using a single 64-path radix-64 solution be longer than when using 4 different radix-16 FFTs each designed as having 16-paths? -- I think they will. ??
Reply by kevin July 7, 20102010-07-07
OK.  I'm convinced that there are at least several ways of
manipulating your inputs such that you can sequence them properly and
process them in the first stage 64 point pipelines.  The only question
is which way is most efficient.

As for the structure of the first stage 64, you were wondering if the
higher radix versions would result in longer wiring routes.  Well,
you're going to have very long wiring routes even for a radix 2
pipeline.  Consider that your input data (as you described) consists
of 64 eight bit words arriving simultaneously (512 bits total).  Each
stage of a 64 point radix 2 pipeline contains 32 butterflies.
Presumably, they're done in parallel to meet the required throughput
speed.  And since you're doing things with complex inputs, you
actually have 512 + 512 = 1024 bits being processed per clock tick
(or, 64 complex 8 bit words per tick).  That's a lot of wires.  And
because of the nature of the FFT, some of those wires are going to be
crossing other wires to get to the proper butterfly (it depends on the
exact algorithm, but every single algorithm is going to require it).

As an aside, you might consider taking a look at the correspondence I
mentioned some time before:

K. J. McGee, &#4294967295;Comments on &#4294967295;A 64-Point Fourier Transform Chip for Video
Motion Compensation Using Phase Correlation,&#4294967295; IEEE J. Solid-State
Circuits, vol. 33, no. 6, June 1998, pp. 928-932.

Note in the abstract where I state: lower radix algorithms can be
modified to be computationally equivalent to higher radix algorithms.
Then I describe why in section III of the above.

It's a true statement, and I figured it out in 1981 after manipulating
a lot of FFT graphs.  For instance, any 64 point radix 2 graph can be
modified such that it contains the EXACT SAME number of adds,
subtracts and multiplies as in a 64 point radix 8 graph.  So lower
radix efficiency can be made = to higher radix efficiency.

So, for me at least, the choice of radix has more to do with what
hardware resources are available, and which arrangement is most
efficient.

So you could proceed by picking a particular 64 point graph.  Start
with radix 2, if you want, and see just how much in the way of
resources you'll need to implement it in full parallel form.

There's yet other important thing to note.  Since you seem to be
fairly flexible as to your inputs, you might seriously consider doing
things in a serial fashion.  Suppose your data were coming in as 512
bits at a time.  Then, after 8 ticks of your 375 Mhz clock, you'd have
512 eight bit words of data, and you'd process them serially.

Whether your data is coming in parallel or serial, you should be able
to get the same high throughput that you require.  Serial processing
may help you decrease the amount of long wire routes.  It's something
that you should consider.

Kevin McGee
Reply by kevin July 6, 20102010-07-06
I looked at some of the FFT cores being offered by various vendors.
Your requirements are quite steep (and I've always thought it's more
fun to do your own).  But you might want to Google "virtex and FFT and
cores" and check them out.  You can get an idea of how others are
handling the I/O.  And most seem to be doing radix 2, with a few radix
4 or mixed radix (and one NxM - arbitrary numbers).

I have more to add after my previous post, but it's late here, so I'll
post again later today.

Kevin McGee
Reply by onkars July 5, 20102010-07-05
> >On Jul 3, 11:09 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com> >wrote: > >>As for the inputs to the FPGA -- thats an open aspect too. It can be >>designed to best suite the FFT implementation. But we will be using the >>HighSpeed serial IOs. So it pretty much depends on the how the FFT is >>implemented -- we could even use external fast demuxes if necessary. > >>Also, the input is complex. > >>As mentioned earlier I should have 64 samples output per clock cycle. As
a
>>reminder I am doing 4096 pt. fft. > >Very useful information. > >I'm a little concerned about the rest of your post because I'm >presuming that your input data is sequential. That is, 8 bits each of >points 0, 1, 2, 3 ... 63, followed by points 64, 65, ... 127, then >128 ... 191, then 192 ... 255, all the way up to 4095. So you have 64 >eight bit sequential inputs per clock tick at 375 Mhz (512 bits >total). > >Here's the problem. With a 4096 point FFT, you've got 2 stages of 64 >point butterflies with 1 column of inter-stage twiddles between them. >No matter what the input/output order, geometry type or decimation, >the top most butterfly in the first stage uses the following inputs: >0, 64, 128, 192 ... 4032 (every 64th input - the last one listed is >4096-64 = 4032). The next butterfly down from the top in the first >stage uses the same sequence above PLUS 1 (1, 65, 129, 193 ... 4033). >The next butterfly down uses a PLUS 2 (2, 66, 130, 194, ... 4034). >Once again, samples separated by 64. > >If you need any clarification, I can send a 64 point 8x8 bitmap >(or .pdf or whatever) via 'reply to author'. It's not at all >difficult to explain what a 64x64 graph looks like after viewing the >8x8 (and I can send the explanation, too). > >Now you can break down the a 4096 many different ways, but you'll >still have the same problem of how to feed the first stage of the >pipeline with the appropriate grouping of inputs. I was concerned >because you post kind of implied that you've already got your 4096 >data coming in as if it were in bit-reversed order. Bit-reversing >your 4096 inputs when they're coming in 64 at a time is not trivial >and requires some careful thought. A double buffer would work, but >it's extra hardware. > >Not to worry. I'm sure it can be done, but the devil's in the >details. > >I've also got a couple of other suggestions, but I'll post them later, >and think about your buffering problem in the meantime. > >You've got an interesting problem. > >Kevin McGee > >
Actually for the ordering of the input -- I have always thought of using double buffers. Extra hardware -- which here is memory -- is abundant on the FPGA (38000 Kb). It will increase latency -- which is not an issue. Thanks a lot. Onkar
Reply by kevin July 4, 20102010-07-04
On Jul 3, 11:09 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com>
wrote:

>As for the inputs to the FPGA -- thats an open aspect too. It can be >designed to best suite the FFT implementation. But we will be using the >HighSpeed serial IOs. So it pretty much depends on the how the FFT is >implemented -- we could even use external fast demuxes if necessary.
>Also, the input is complex.
>As mentioned earlier I should have 64 samples output per clock cycle. As a >reminder I am doing 4096 pt. fft.
Very useful information. I'm a little concerned about the rest of your post because I'm presuming that your input data is sequential. That is, 8 bits each of points 0, 1, 2, 3 ... 63, followed by points 64, 65, ... 127, then 128 ... 191, then 192 ... 255, all the way up to 4095. So you have 64 eight bit sequential inputs per clock tick at 375 Mhz (512 bits total). Here's the problem. With a 4096 point FFT, you've got 2 stages of 64 point butterflies with 1 column of inter-stage twiddles between them. No matter what the input/output order, geometry type or decimation, the top most butterfly in the first stage uses the following inputs: 0, 64, 128, 192 ... 4032 (every 64th input - the last one listed is 4096-64 = 4032). The next butterfly down from the top in the first stage uses the same sequence above PLUS 1 (1, 65, 129, 193 ... 4033). The next butterfly down uses a PLUS 2 (2, 66, 130, 194, ... 4034). Once again, samples separated by 64. If you need any clarification, I can send a 64 point 8x8 bitmap (or .pdf or whatever) via 'reply to author'. It's not at all difficult to explain what a 64x64 graph looks like after viewing the 8x8 (and I can send the explanation, too). Now you can break down the a 4096 many different ways, but you'll still have the same problem of how to feed the first stage of the pipeline with the appropriate grouping of inputs. I was concerned because you post kind of implied that you've already got your 4096 data coming in as if it were in bit-reversed order. Bit-reversing your 4096 inputs when they're coming in 64 at a time is not trivial and requires some careful thought. A double buffer would work, but it's extra hardware. Not to worry. I'm sure it can be done, but the devil's in the details. I've also got a couple of other suggestions, but I'll post them later, and think about your buffering problem in the meantime. You've got an interesting problem. Kevin McGee
Reply by onkars July 4, 20102010-07-04
>On Jul 3, 1:38=A0am, kevin <kevinjmc...@netscape.net> wrote: >> On Jul 1, 5:54 pm, "onkars" <onkar.sarode@n_o_s_p_a_m.gmail.com> >> wrote: >> >> >I figured out that I could realistically achieve 350 MHz on the FPGA
--
>> >which implies that to achieve 20 Gsamples/s I need approx. 64 samples >> >output per clock cycle @350 MHz. >> >> OK. =A0I get it now why you are probably going to have to use an array >> type arrangement to do the 64 point. >> >> But I'm not clear exactly how your data is coming into the VIRTEX >> slices - are you using the Ghz serial port? =A0And how is the input
data
>> partitioned and/or sequenced? =A0Sequential in, I suppose. >> >> You mentioned the need for 100% utilization for the input buffering. >> Quite possibly you may find the following to be helpful. =A0First, >> things like bit reversal or matrix transpose are fairly easy out of a >> RAM. =A0For instance, in several bit reverse in/sequential out >> architectures, the data is first bit reversed by reading/writing out >> of RAM. =A0Consider the simple case of an 8 point FFT. =A0Data points 0
t=
>o >> 7 are loaded into sequential RAM locations, with the addressing >> controlled by a counter. =A0Now the counter has a 2 to 1 mux on every >> bit output. =A0Things are arranged such that the counter output is >> either in sequential mode (i.e.: count from 0 to 7), or in bit >> reversed mode (count in bit reversed order). =A0When the second set of
8
>> data points arrives, you do a read/write of the location. =A0The data >> being read out will be the bit reversed sequence of your first 8 >> points. =A0The second set of data will actually be stored in the bit >> reversed locations. =A0When the third set of 8 points arrives, you
read/
>> write again with the counter in sequential mode. =A0The second set of >> data will also come out in bit reversed order. =A0All you need do is >> start out correctly and switch the counter mode every 8 data samples. >> It actually works - I've done things that way for a very long time. >> Try it using the 8 point example, it's really not at all difficult to >> understand. >> >> Here's another thing about the above point: =A0take a look at the shift >> cell / commutator arrangement that you see in a great many pipeline >> architectures. =A0They're all bit reversers or matrix transposers. If >> you have a fast enough RAM that can tolerate the input speeds, you >> might consider doing things that way. =A0All you have to do is set up >> the counter properly. =A0Advantage: no more commutators. >> >> And just another thought: presuming your input data is real only, have >> you considered using the '2N real FFT with an N complex FFT' >> technique. =A0That would cut your FFT size down to 2048. =A0There's a >> little bit of extra processing involved, but it's roughly equivalent >> to adding a column of butterflies at the end. >> >> I really haven't had enough time to look at your post thoroughly. >> I'll spend some time over the weekend on it. =A0I also spent a little >> time scanning the VIRTEX specs. >> >> I could suggest other things or different architectures (or make >> better comments on your suggestions), but I'd need to have a clearer >> idea of just exactly how that input data is coming in. >> >> Kevin McGee > >Meant to say: the counter outputs go to a 2 to 1 mux such that you can >select either sequential output or bit reversed output. >
Thank you. I really appreciate your time. As for the inputs to the FPGA -- thats an open aspect too. It can be designed to best suite the FFT implementation. But we will be using the HighSpeed serial IOs. So it pretty much depends on the how the FFT is implemented -- we could even use external fast demuxes if necessary. Also, the input is complex. As mentioned earlier I should have 64 samples output per clock cycle. As a reminder I am doing 4096 pt. fft. So this can be done using various options e.g. one FFT made of 64-path radix-64 using 2 stages (full row and column parallel implementation of radix-64) OR 4 FFTs made of 16-path radix-16 using 3 stages (full row and column parallel implementation of radix-16) OR 8 FFTs made of 8-path radix-8 (full row and column parallel implementation of radix-8) using 4 stages Using smaller radices has following disadvantages --- The number of stages and required number of FFTs (to achieve 64 samples/clock) increases. Reasonably assuming that the number arithmetic units for any approach is same --- this means that the control overhead (scheduling logic + commutators using logic not mem. etc.) is increased. I am not sure how much percentage this is. Advantage of smaller radices is that the routes will be shorter (???) and higher frequencies can be achievable. Will the routes when using a single 64-path radix-64 solution be longer than when using 4 16-path radix-16 ffts ?? Somehow I am inclined towards going for the 16 path radix 16 solution, and aim to achieve 250 to 300 MHz. I have seen a paper (credible) claim they achieved 40 MHz for a fully parallel (row and column) 256 pt. fft using 1024 radix 2 butterflies on Virtex 5.