robert bristow-johnson <rbj@audioimagination.com> wrote:

(snip)
>>> one idea to consider, if you're already comfortable with fixed-point
>>> FFT, is block-floating-point.

>>> your FFT butterfly will need two modes:  one where the butterfly is the
>>> normal butterfly and the other where all of the results, real and
>>> imaginary for both outputs of the butterfly are divided by 2 (by use of
>>> an arithmetic shift right).

>> Done right, that should work pretty well in the FPGA.
 
> it's not that much different than fixed point.
 
(snip, I wrote)
>> You should be able to factor out any exponent before the data goes in,
>> so all you need to do is count the shifts.
 
> well that's exactly what i intended to say.  but this block could have 
> an exponent going into the FFT.  each time you scale the whole block by 
> 1/2 (by a right shift of 1 bit) you increment that common exponent by 1. 
>  that's the same as counting the shifts.  the exponent could be set to 
> zero before beginning if that's what you want to do.  or it could be set 
> to whatever it was meant to be set to by the magnitude of the data 
> inside.  that block data should be normalized so that it has headroom of 
> 12 to 18 dB.  that is it should be at least 1/8 fullscale and no more 
> than 1/4 full scale.  two guard bits on the left.

Well, the suggestion was to not waste bits on what is easy to do
outside the FPGA, but yes, you could run all the bits through.
 
>> Though that gets a little more complicated when you 
>> do things in parallel like the OP wants.
 
> what's in parallel?  different parts of the pass?  if that's the case, 
> it's not any worse.  all parallel butterflies should be in the same 
> scaling mode and the sticky bits of each parallel section can be ORed 
> together at the end of a pass.

Well, it seems that the OP wants to combine 32K point blocks together.

So, instead of one or two butterflies, you could do it like 15
butterflies at once. 
 
> i think you gotta finish one pass before starting the next.
 
(snip)
>> With the 18 bit block multipliers, bit widths like 18 and 35 are good
>> choices.
 
> i wouldn't know what FPGA provide.  35 bits sounds good enough for good 
> twiddle factors.  can you easily do a 35x35 signed multiplication with 
> these 18-bit block multipliers?

The low 17 bits of a signed multiplier should work just like an unsigned
multiplier. I think you can use that for the low half, and 18 bit signed
for the high half (well, not quite half).
 
> i dunno squat about the particulars of the FPGA.  i figgered, you can 
> have the multiplier be as wide as you want, given enough gates and a 
> good library to define it.

Yes, you can do that, but it would take a lot of gates.

Because multipliers are common enough, they put in block multipliers.
Much more efficient in gate usage, unless you don't need to do any
multiplication. There are also block RAMs that are more efficient than
using LUTs for RAM.

-- glen

On 12/12/12 6:12 PM, glen herrmannsfeldt wrote:
> robert bristow-johnson<rbj@audioimagination.com>  wrote:
>> On 12/12/12 12:05 PM, gongdori wrote:
>
>>> 3) floating point vs. fixed point - ah... This is what has been pulling my
>>> hair. In my opinion, fixed point is good enough and it is much more
>>> suitable in FPGA. I've seen some floating point libraries and cores, it
>>> takes multiple clock cycles per operation. I've written VHDL code which
>>> does n-point FFT and thinking about modifying it to handle floating point,
>>> but I haven't done it yet.
>
>> one idea to consider, if you're already comfortable with fixed-point
>> FFT, is block-floating-point.
>
>> your FFT butterfly will need two modes:  one where the butterfly is the
>> normal butterfly and the other where all of the results, real and
>> imaginary for both outputs of the butterfly are divided by 2 (by use of
>> an arithmetic shift right).
>
> Done right, that should work pretty well in the FPGA.

it's not that much different than fixed point.

> ...
>> for the latter mode, i dunno what you
>> would do about the bit that falls off because there might be problems if
>> you always round down.  you might have to round up/down as a random
>> outcome.  maybe not.
>
> I don't know about that one, either.
>
>> so some entire passes are done in the normal mode and some entire passes
>> are done in the divide-by-2 mode.  every pass that the divide-by-2 mode
>> is used, there is a single exponent for the whole block that you
>> increment by 1.  when the normal mode pass is used, then the exponent is
>> left alone.
>
> You should be able to factor out any exponent before the data goes in,
> so all you need to do is count the shifts.

well that's exactly what i intended to say.  but this block could have 
an exponent going into the FFT.  each time you scale the whole block by 
1/2 (by a right shift of 1 bit) you increment that common exponent by 1. 
  that's the same as counting the shifts.  the exponent could be set to 
zero before beginning if that's what you want to do.  or it could be set 
to whatever it was meant to be set to by the magnitude of the data 
inside.  that block data should be normalized so that it has headroom of 
12 to 18 dB.  that is it should be at least 1/8 fullscale and no more 
than 1/4 full scale.  two guard bits on the left.

> Though that gets a little
> more complicated when you do things in parallel like the OP wants.

what's in parallel?  different parts of the pass?  if that's the case, 
it's not any worse.  all parallel butterflies should be in the same 
scaling mode and the sticky bits of each parallel section can be ORed 
together at the end of a pass.

i think you gotta finish one pass before starting the next.

>
>> to decide which mode you use, assume 12 dB of headroom to start with.
>> the input data is "normalized" so that no sample exceeds 1/4 full scale.
>>   it sounds like 2 bits of waste in each fixed-point sample, but you
>> need those for guard bits.  then, for each FFT pass, you need a "sticky
>> bit" (that is cleared before the pass begins) that detects if any result
>> has a word that exceeds that 1/4 fullscale limit.  if any one word,
>> either real or imag part, exceeds that 1/4 fullscale threshold (in abs
>> value), you set the sticky bit.
>
> With the 18 bit block multipliers, bit widths like 18 and 35 are good
> choices.

i wouldn't know what FPGA provide.  35 bits sounds good enough for good 
twiddle factors.  can you easily do a 35x35 signed multiplication with 
these 18-bit block multipliers?

i dunno squat about the particulars of the FPGA.  i figgered, you can 
have the multiplier be as wide as you want, given enough gates and a 
good library to define it.

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

robert bristow-johnson <rbj@audioimagination.com> wrote:
> On 12/12/12 12:05 PM, gongdori wrote:

>> 3) floating point vs. fixed point - ah... This is what has been pulling my
>> hair. In my opinion, fixed point is good enough and it is much more
>> suitable in FPGA. I've seen some floating point libraries and cores, it
>> takes multiple clock cycles per operation. I've written VHDL code which
>> does n-point FFT and thinking about modifying it to handle floating point,
>> but I haven't done it yet.
 
> one idea to consider, if you're already comfortable with fixed-point 
> FFT, is block-floating-point.
 
> your FFT butterfly will need two modes:  one where the butterfly is the 
> normal butterfly and the other where all of the results, real and 
> imaginary for both outputs of the butterfly are divided by 2 (by use of 
> an arithmetic shift right).  

Done right, that should work pretty well in the FPGA.

The problem with floating point in an FPGA is that pre and post
normalization takes a lot of logic. A little better in the newer
families with 6LUTs instead of 4LUTs, but it is still big.

Well, that is for add. For multiply and divide, it is much easier,
though divide itself, that is, a combinatorial divider, is big.

But if you do this one right, it should only have (or not) one
shift after each stage. 

> for the latter mode, i dunno would you 
> would do about the bit that falls off because there might be problems if 
> you always round down.  you might have to round up/down as a random 
> outcome.  maybe not.

I don't know about that one, either.
 
> so some entire passes are done in the normal mode and some entire passes 
> are done in the divide-by-2 mode.  every pass that the divide-by-2 mode 
> is used, there is a single exponent for the whole block that you 
> increment by 1.  when the normal mode pass is used, then the exponent is 
> left alone.

You should be able to factor out any exponent before the data goes in,
so all you need to do is count the shifts. Though that gets a little
more complicated when you do things in parallel like the OP wants.
 
> to decide which mode you use, assume 12 dB of headroom to start with. 
> the input data is "normalized" so that no sample exceeds 1/4 full scale. 
>  it sounds like 2 bits of waste in each fixed-point sample, but you 
> need those for guard bits.  then, for each FFT pass, you need a "sticky 
> bit" (that is cleared before the pass begins) that detects if any result 
> has a word that exceeds that 1/4 fullscale limit.  if any one word, 
> either real or imag part, exceeds that 1/4 fullscale threshold (in abs 
> value), you set the sticky bit.

With the 18 bit block multipliers, bit widths like 18 and 35 are good
choices. 
 
> if that sticky bit is set at the beginning of the pass, then all of the 
> butterflies for the upcoming pass are in divide-by-2 mode and the sticky 
> bit is cleared.  otherwise the butterflies are in normal mode and the 
> sticky bit is already cleared at the beginning of the pass.
 
> admittedly this is identical in concept to what the Mot (now Freescale) 
> DSP56K does.  which is where i drew it conceptually.
 
> bit for bit, 32-bit fixed point is better than 32-bit IEEE floating 
> point unless your headroom needs to be very large (40 dB or more). 
> those 8 bits of exponent can be spent more effectively on mantissa.

-- glen

On 12/12/12 12:05 PM, gongdori wrote:
>
> 3) floating point vs. fixed point - ah... This is what has been pulling my
> hair. In my opinion, fixed point is good enough and it is much more
> suitable in FPGA. I've seen some floating point libraries and cores, it
> takes multiple clock cycles per operation. I've written VHDL code which
> does n-point FFT and thinking about modifying it to handle floating point,
> but I haven't done it yet.
>

one idea to consider, if you're already comfortable with fixed-point 
FFT, is block-floating-point.

your FFT butterfly will need two modes:  one where the butterfly is the 
normal butterfly and the other where all of the results, real and 
imaginary for both outputs of the butterfly are divided by 2 (by use of 
an arithmetic shift right).  for the latter mode, i dunno would you 
would do about the bit that falls off because there might be problems if 
you always round down.  you might have to round up/down as a random 
outcome.  maybe not.

so some entire passes are done in the normal mode and some entire passes 
are done in the divide-by-2 mode.  every pass that the divide-by-2 mode 
is used, there is a single exponent for the whole block that you 
increment by 1.  when the normal mode pass is used, then the exponent is 
left alone.

to decide which mode you use, assume 12 dB of headroom to start with. 
the input data is "normalized" so that no sample exceeds 1/4 full scale. 
  it sounds like 2 bits of waste in each fixed-point sample, but you 
need those for guard bits.  then, for each FFT pass, you need a "sticky 
bit" (that is cleared before the pass begins) that detects if any result 
has a word that exceeds that 1/4 fullscale limit.  if any one word, 
either real or imag part, exceeds that 1/4 fullscale threshold (in abs 
value), you set the sticky bit.

if that sticky bit is set at the beginning of the pass, then all of the 
butterflies for the upcoming pass are in divide-by-2 mode and the sticky 
bit is cleared.  otherwise the butterflies are in normal mode and the 
sticky bit is already cleared at the beginning of the pass.

admittedly this is identical in concept to what the Mot (now Freescale) 
DSP56K does.  which is where i drew it conceptually.

bit for bit, 32-bit fixed point is better than 32-bit IEEE floating 
point unless your headroom needs to be very large (40 dB or more). 
those 8 bits of exponent can be spent more effectively on mantissa.

-- 

r b-j                  rbj@audioimagination.com

"Imagination is more important than knowledge."

gongdori <16548@dsprelated> wrote:

(snip)
>>> I am interested in performing million-point FFT in FPGA. 

(snip)
>>1M points would require 128Mb of storage, if using 128 bits for each 
>>complex point.  You didn't specify a SFDR or other metric which would 
>>allow the number of bits to be established.  I guessed at 128.

(snip)
>>> I am especially interested in a way where I can avoid parallel input.

Well, you can avoid parallel input by separating it out once it
gets into the FPGA.

(snip)
> Thank you all for your input. I really appreciate it. 
> Let me answer some of your questions:

> 1) I think about using FPGA to process big big data in shorter time. And,
> I'm thinking about using Xilinx Virtex-7 part (VC707, if you are familiar
> with FPGA). 

Yes, but one million points isn't so big in NlogN terms, and you
don't say how short a time is needed.

> 2) The FFT architecture I am thinking of is Streaming Radix-2^2
> architecture which can work on a stream of data. However, since I have much
> more resource than what a single 32-k streaming FFT requires, I thought
> about using multiple FFTs for parallel input (also the math was shown in
> somebody's paper I found online). 

Well, the whole idea behind FFT is that you can combine (usually) two
FFTs of length N and get one of length 2N. Repeat that the appropriate
number of times, and you get the O(NlogN). 

You should understand how that happens inside the algorithm, even if
only for a fairly small N. You might write down a diagram showing how
two 4 point FFTs are combined to make 8 points, then consider doing the
two in parallel and combining the outputs.

Then you should understand the difference between decimation-in-time and
decimation-in-frequency. Either works, but there is a difference in the
order you need the data points. (You can even mix them, if that turns
out to be better.)

> 3) floating point vs. fixed point - ah... This is what has been pulling my
> hair. In my opinion, fixed point is good enough and it is much more
> suitable in FPGA. I've seen some floating point libraries and cores, it
> takes multiple clock cycles per operation. I've written VHDL code which
> does n-point FFT and thinking about modifying it to handle floating point,
> but I haven't done it yet. 

Yes fixed point is a better choice. You can even do it much wider than
the significand of floating point would be, and use less logic.

That is, compare 32 bit fixed point with 32 significant bits, to 32 bit
floating point with 24 significant bits. 

Now, it gets more interesting when you have 18x18 block multipliers.
(I believe signed, which complicates combining them.) Some bit widths
will be much easier than others.

> 4) I have a plan to use PCIe interface to move data in and out...

> I have written streaming, and parallel FFTs in VHDL and thinking about how
> to mix the two approaches to maximize the throughput and minimize the
> latency of the design... But, if you have any other bright idea, or a good
> reference, please let me know!

If you can control the order of the input data, then there are many
interesting things you can do, especially as combinations of decimation
in time and frequency. If you have N FFT blocks running in parallel,
you can fill one block before going to the next one, or send every Nth
input to an FFT block. The way you combine the outputs, and also the
order that the data comes out, will be different.

How many million (is it really 1000000 or is it 2**20?) point FFTs do
you need, and how fast do you need them?

-- glen

>On Tue, 11 Dec 2012 07:20:07 -0600, gongdori wrote:
>
>> Hi all,
>> 
>> I am interested in performing million-point FFT in FPGA. I've read some
>> papers and concluded that the best way of performing million-point FFT
>> would be having multiple smaller size FFT in parallel, for example 8x
>> 16K-point FFT, with twiddle factor multipliers (I think it will be
>> radix-2^2 streaming FFT with multiple inputs)
>> But, since I am a newbie in DSP, I want to check if there is any easier
>> way. I am especially interested in a way where I can avoid parallel
>> input.
>> 
>> Could you give me your idea/comment?
>> Thanks!
>> 
>> Thomas
>
>
>1M points would require 128Mb of storage, if using 128 bits for each 
>complex point.  You didn't specify a SFDR or other metric which would 
>allow the number of bits to be established.  I guessed at 128.
>
>The largest FPGAs in the newest families from the two leading FPGA 
>vendors have onboard storage for about 50-60Mb.  These largest parts cost

>thousands of dollars each.  They will not fit all your points in onboard 
>RAM.
>
>Your 16k point suggestion would require 2Mb, again assuming 128 bits per 
>complex point.  Suitably sized FPGAs for 16k points are much cheaper, but

>you would need an external memory to hold all 1M points.
>
>The disadvantage of external memory is that it becomes the limiting 
>factor in the throughput.  In that sense, an FPGA won't do any better 
>than a (cheaper and easier-to-program) DSP chip which can use exactly the

>same type of memory.
>
>A single 16 bit wide DDR3 RAM chip requires about 40 FPGA pins and can 
>transfer about 10-15Gb/s.  (The RAMs can go much faster, but it's hard to

>get the soft FPGA controllers to work as well as the hard controllers in 
>other devices.)
>OTOH, you can perhaps fit a dozen such RAM interfaces into a (large) 
>FPGA, leading to up to 180Gb/s of data transfer.
>That's a peak rate though.  It can be sustained only with linear 
>addressing of large blocks (for some value of "large").  I'm guessing 
>your method of using multiple 16k FFTs requires the opposite of linear 
>addressing, so this may not work out too well unless you're very careful 
>with how you (re)order you RAM accesses.
>
>> I am especially interested in a way where I can avoid parallel input.
>
>Use serial input then.  Many modern FPGAs have multiple multi-megabit 
>transceivers.  A protocol like 10Gb/s Ethernet could give you a "fill 
>time" of less than 20ms.
>
>
>Nice homework problem, BTW.
>
>Regards,
>Allan
>

Thank you all for your input. I really appreciate it. 
Let me answer some of your questions:

1) I think about using FPGA to process big big data in shorter time. And,
I'm thinking about using Xilinx Virtex-7 part (VC707, if you are familiar
with FPGA). 

2) The FFT architecture I am thinking of is Streaming Radix-2^2
architecture which can work on a stream of data. However, since I have much
more resource than what a single 32-k streaming FFT requires, I thought
about using multiple FFTs for parallel input (also the math was shown in
somebody's paper I found online). 

3) floating point vs. fixed point - ah... This is what has been pulling my
hair. In my opinion, fixed point is good enough and it is much more
suitable in FPGA. I've seen some floating point libraries and cores, it
takes multiple clock cycles per operation. I've written VHDL code which
does n-point FFT and thinking about modifying it to handle floating point,
but I haven't done it yet. 

4) I have a plan to use PCIe interface to move data in and out...

I have written streaming, and parallel FFTs in VHDL and thinking about how
to mix the two approaches to maximize the throughput and minimize the
latency of the design... But, if you have any other bright idea, or a good
reference, please let me know!

Thanks for giving me the link for Dillon and ams. I'll go through it soon!

Have a nice day!

Thomas

On Wed, 12 Dec 2012 04:42:06 +0000, glen herrmannsfeldt wrote:

> Tim Wescott <tim@seemywebsite.please> wrote:
> (snip)
>>> Then Tim Wescott <tim@seemywebsite.com> wrote:
>>>> For certain problems -- and this one 'feels' like a candidate --
>>>> doing math on an FPGA can be less expensive in terms of BOM cost,
>>>> space consumed by hardware, and/or power consumption.
> 
> (snip, then I wrote)
>>> Certainly that is true. But then again, you can always use a soft
>>> processor on the FPGA.
>  
>>> The OP didn't say fixed or floating point, which also might matter.
>  
>>> It could also be a school assignment to do something on an FPGA, and
>>> possibly with the requirement that it not be in a soft processor.
>   
> (snip)
> 
>> On the whole, I think a soft processor would be slower, bigger, hotter,
>> and more expensive than a real one.
> 
> Yes, but the FPGA has the advantage that you can put much of the other
> logic that you need inside, also.
>  
>> I was thinking more on the lines that if the task entailed slurping in
>> 10^6 data points, doing an FFT, then spewing out 10^6 results as fast
>> as possible, that an FPGA might (most certainly might, not "would") be
>> the answer.
> 
> The "as fast as possible" was not in the original post. For fixed point,
> I would be pretty sure that the FPGA would be a good choice. I would be
> a lot less sure for floating point.

True.  I tend to think that if someone wants to use an FPGA for a job, 
sans processor, that the "as fast as possible" is implied.  But that's 
often wrong.

> Many FPGA families include an 18x18 block multiplier. If 18 bits is
> enough, then the limit is mostly how fast you can stuff numbers through
> the multipliers, though a little bit moving them around.

Even if 18x18 isn't enough, the multiplier would help a lot with larger 
word lengths.

> I haven't gone through the bit-reversed order algorithm recently, but I
> believe that if you can get them in that order it works pretty well. If
> not, then you have to store them somewhere, and though FPGAs do have a
> lot of memory now, maybe not that much.

They don't have that much memory yet, to my knowledge.  One would need an 
FPGA with RAM hanging off of it.  Of course, this would still present 
interesting challenges if the data needs to be reordered and the RAM 
favors burst transfers (but then, I do recall seeing an algorithm for an 
FFT to be performed on data stored on tape that keeps the data in order 
-- this would work well for an FFT on data stored in SDRAM and read out 
in blocks, fast).

>> Of course, if the assignment is coming from a prof, then that's all she
>> wrote.
> 
> Yes. Maybe the OP will explain more.

I assumed a lot in my responses.  It would be an interesting problem if 
an FPGA was actually demanded for technical reasons, and not just because 
it's required for pedagogical reasons or because someone doesn't 
understand what an FPGA is good for.

-- 
Tim Wescott
Control system and signal processing consulting
www.wescottdesign.com

On Tue, 11 Dec 2012 07:20:07 -0600, gongdori wrote:

> Hi all,
> 
> I am interested in performing million-point FFT in FPGA. I've read some
> papers and concluded that the best way of performing million-point FFT
> would be having multiple smaller size FFT in parallel, for example 8x
> 16K-point FFT, with twiddle factor multipliers (I think it will be
> radix-2^2 streaming FFT with multiple inputs)
> But, since I am a newbie in DSP, I want to check if there is any easier
> way. I am especially interested in a way where I can avoid parallel
> input.
> 
> Could you give me your idea/comment?
> Thanks!
> 
> Thomas

1M points would require 128Mb of storage, if using 128 bits for each 
complex point.  You didn't specify a SFDR or other metric which would 
allow the number of bits to be established.  I guessed at 128.

The largest FPGAs in the newest families from the two leading FPGA 
vendors have onboard storage for about 50-60Mb.  These largest parts cost 
thousands of dollars each.  They will not fit all your points in onboard 
RAM.

Your 16k point suggestion would require 2Mb, again assuming 128 bits per 
complex point.  Suitably sized FPGAs for 16k points are much cheaper, but 
you would need an external memory to hold all 1M points.

The disadvantage of external memory is that it becomes the limiting 
factor in the throughput.  In that sense, an FPGA won't do any better 
than a (cheaper and easier-to-program) DSP chip which can use exactly the 
same type of memory.

A single 16 bit wide DDR3 RAM chip requires about 40 FPGA pins and can 
transfer about 10-15Gb/s.  (The RAMs can go much faster, but it's hard to 
get the soft FPGA controllers to work as well as the hard controllers in 
other devices.)
OTOH, you can perhaps fit a dozen such RAM interfaces into a (large) 
FPGA, leading to up to 180Gb/s of data transfer.
That's a peak rate though.  It can be sustained only with linear 
addressing of large blocks (for some value of "large").  I'm guessing 
your method of using multiple 16k FFTs requires the opposite of linear 
addressing, so this may not work out too well unless you're very careful 
with how you (re)order you RAM accesses.

> I am especially interested in a way where I can avoid parallel input.

Use serial input then.  Many modern FPGAs have multiple multi-megabit 
transceivers.  A protocol like 10Gb/s Ethernet could give you a "fill 
time" of less than 20ms.

Nice homework problem, BTW.

Regards,
Allan

"gongdori" <16548@dsprelated> writes:

> Hi all,
>
> I am interested in performing million-point FFT in FPGA. 

Why in an FPGA specifically?  

* For interest?
* or because you have analysis that says an FPGA is optimal for your
  particular set of constraints?

The first is good enough reason in itself.  Otherwise until you *need*
an FPGA to meet some constraint or other, steer clear.  (I speak as one
who "does" FPGAs for a living)

> I've read some papers and concluded that the best way of performing
> million-point FFT would be having multiple smaller size FFT in
> parallel, for example 8x 16K-point FFT, with twiddle factor
> multipliers (I think it will be radix-2^2 streaming FFT with multiple
> inputs) But, since I am a newbie in DSP, I want to check if there is
> any easier way. I am especially interested in a way where I can avoid
> parallel input.

2 calculations might be instructive:

* How big an FPGA do you need to do a 16K point FFT - you'll probably
  find yourself multiplier or RAM limited rather than logic limited, so
  the calculations are realtively easy

* For you chosen FFT setup: how long will you spend shuffling data in
  and out of external memory?

FFTs in FPGA often turn into a memory-bandwidth-bound problem, as the
memory/logic density of FPGAs doesn't match up to the cache/processing
density of a high-end processor.

Cheers,
Martin

-- 
martin.j.thompson@trw.com 
TRW Conekt - Consultancy in Engineering, Knowledge and Technology
http://www.conekt.co.uk/capabilities/39-electronic-hardware

gongdori <16548@dsprelated> wrote:
> I am interested in performing million-point FFT in FPGA.

You don't mention the word size you want to use, fixed or floating
point, input speed, latency requirements, complex or real-only inputs,
etc., so it's difficult to say: this is what you need to look at.
However, FFT architectures can be broadly categorized into 4 basic
types: 1) single processor, 2) multi-processor, 3) pipeline (one
processor for each butterfly column), or 4) systolic (one processor
per butterfly).  Within the 4 categories, there are a large number of
variations.  Rabiner and Gold's book: Theory and Application of
Digital Signal Processing (Chapter 10), would be a good place to
start.

Pipeline architectures can be very high-speed, while minimizing the
input/output requirements.  Data streams into one end, and, after some
latency, streams out the other.  Most architectures operate
continuously, with no breaks on the input or output (ie: the first
million data points to be transformed are followed by the second
million to be transformed, etc.).  Depending on the algorithm/
architecture, bit-reversal can be done on the input, on the output, or
within the processor by proper buffering.

A relatively straightforward design would be a 1024 point pipeline,
followed by a million word buffer, followed by another 1024 point
pipeline.  Each 1024 point pipeline could be configured as 10 radix-2
stages, but there are a great many variations possible, depending on
the requirements.

There are companies that produce FPGA designs for FFT pipelines, such
as Dillon Engineering, and Annapolis Microsystems:

http://ecat.edacafe.com/corpprofile.php?vendor_id=9001925

http://www.annapmicro.com/aboutus.html

But I think you'll have to be more specific about your requirements
before you proceed.

Kevin