Reply by glen herrmannsfeldt December 13, 20122012-12-13
robert bristow-johnson <rbj@audioimagination.com> wrote:

(snip)
>>> one idea to consider, if you're already comfortable with fixed-point >>> FFT, is block-floating-point.
>>> your FFT butterfly will need two modes: one where the butterfly is the >>> normal butterfly and the other where all of the results, real and >>> imaginary for both outputs of the butterfly are divided by 2 (by use of >>> an arithmetic shift right).
>> Done right, that should work pretty well in the FPGA.
> it's not that much different than fixed point.
(snip, I wrote)
>> You should be able to factor out any exponent before the data goes in, >> so all you need to do is count the shifts.
> well that's exactly what i intended to say. but this block could have > an exponent going into the FFT. each time you scale the whole block by > 1/2 (by a right shift of 1 bit) you increment that common exponent by 1. > that's the same as counting the shifts. the exponent could be set to > zero before beginning if that's what you want to do. or it could be set > to whatever it was meant to be set to by the magnitude of the data > inside. that block data should be normalized so that it has headroom of > 12 to 18 dB. that is it should be at least 1/8 fullscale and no more > than 1/4 full scale. two guard bits on the left.
Well, the suggestion was to not waste bits on what is easy to do outside the FPGA, but yes, you could run all the bits through.
>> Though that gets a little more complicated when you >> do things in parallel like the OP wants.
> what's in parallel? different parts of the pass? if that's the case, > it's not any worse. all parallel butterflies should be in the same > scaling mode and the sticky bits of each parallel section can be ORed > together at the end of a pass.
Well, it seems that the OP wants to combine 32K point blocks together. So, instead of one or two butterflies, you could do it like 15 butterflies at once.
> i think you gotta finish one pass before starting the next.
(snip)
>> With the 18 bit block multipliers, bit widths like 18 and 35 are good >> choices.
> i wouldn't know what FPGA provide. 35 bits sounds good enough for good > twiddle factors. can you easily do a 35x35 signed multiplication with > these 18-bit block multipliers?
The low 17 bits of a signed multiplier should work just like an unsigned multiplier. I think you can use that for the low half, and 18 bit signed for the high half (well, not quite half).
> i dunno squat about the particulars of the FPGA. i figgered, you can > have the multiplier be as wide as you want, given enough gates and a > good library to define it.
Yes, you can do that, but it would take a lot of gates. Because multipliers are common enough, they put in block multipliers. Much more efficient in gate usage, unless you don't need to do any multiplication. There are also block RAMs that are more efficient than using LUTs for RAM. -- glen
Reply by robert bristow-johnson December 13, 20122012-12-13
On 12/12/12 6:12 PM, glen herrmannsfeldt wrote:
> robert bristow-johnson<rbj@audioimagination.com> wrote: >> On 12/12/12 12:05 PM, gongdori wrote: > >>> 3) floating point vs. fixed point - ah... This is what has been pulling my >>> hair. In my opinion, fixed point is good enough and it is much more >>> suitable in FPGA. I've seen some floating point libraries and cores, it >>> takes multiple clock cycles per operation. I've written VHDL code which >>> does n-point FFT and thinking about modifying it to handle floating point, >>> but I haven't done it yet. > >> one idea to consider, if you're already comfortable with fixed-point >> FFT, is block-floating-point. > >> your FFT butterfly will need two modes: one where the butterfly is the >> normal butterfly and the other where all of the results, real and >> imaginary for both outputs of the butterfly are divided by 2 (by use of >> an arithmetic shift right). > > Done right, that should work pretty well in the FPGA.
it's not that much different than fixed point.
> ... >> for the latter mode, i dunno what you >> would do about the bit that falls off because there might be problems if >> you always round down. you might have to round up/down as a random >> outcome. maybe not. > > I don't know about that one, either. > >> so some entire passes are done in the normal mode and some entire passes >> are done in the divide-by-2 mode. every pass that the divide-by-2 mode >> is used, there is a single exponent for the whole block that you >> increment by 1. when the normal mode pass is used, then the exponent is >> left alone. > > You should be able to factor out any exponent before the data goes in, > so all you need to do is count the shifts.
well that's exactly what i intended to say. but this block could have an exponent going into the FFT. each time you scale the whole block by 1/2 (by a right shift of 1 bit) you increment that common exponent by 1. that's the same as counting the shifts. the exponent could be set to zero before beginning if that's what you want to do. or it could be set to whatever it was meant to be set to by the magnitude of the data inside. that block data should be normalized so that it has headroom of 12 to 18 dB. that is it should be at least 1/8 fullscale and no more than 1/4 full scale. two guard bits on the left.
> Though that gets a little > more complicated when you do things in parallel like the OP wants.
what's in parallel? different parts of the pass? if that's the case, it's not any worse. all parallel butterflies should be in the same scaling mode and the sticky bits of each parallel section can be ORed together at the end of a pass. i think you gotta finish one pass before starting the next.
> >> to decide which mode you use, assume 12 dB of headroom to start with. >> the input data is "normalized" so that no sample exceeds 1/4 full scale. >> it sounds like 2 bits of waste in each fixed-point sample, but you >> need those for guard bits. then, for each FFT pass, you need a "sticky >> bit" (that is cleared before the pass begins) that detects if any result >> has a word that exceeds that 1/4 fullscale limit. if any one word, >> either real or imag part, exceeds that 1/4 fullscale threshold (in abs >> value), you set the sticky bit. > > With the 18 bit block multipliers, bit widths like 18 and 35 are good > choices.
i wouldn't know what FPGA provide. 35 bits sounds good enough for good twiddle factors. can you easily do a 35x35 signed multiplication with these 18-bit block multipliers? i dunno squat about the particulars of the FPGA. i figgered, you can have the multiplier be as wide as you want, given enough gates and a good library to define it. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by glen herrmannsfeldt December 12, 20122012-12-12
robert bristow-johnson <rbj@audioimagination.com> wrote:
> On 12/12/12 12:05 PM, gongdori wrote:
>> 3) floating point vs. fixed point - ah... This is what has been pulling my >> hair. In my opinion, fixed point is good enough and it is much more >> suitable in FPGA. I've seen some floating point libraries and cores, it >> takes multiple clock cycles per operation. I've written VHDL code which >> does n-point FFT and thinking about modifying it to handle floating point, >> but I haven't done it yet.
> one idea to consider, if you're already comfortable with fixed-point > FFT, is block-floating-point.
> your FFT butterfly will need two modes: one where the butterfly is the > normal butterfly and the other where all of the results, real and > imaginary for both outputs of the butterfly are divided by 2 (by use of > an arithmetic shift right).
Done right, that should work pretty well in the FPGA. The problem with floating point in an FPGA is that pre and post normalization takes a lot of logic. A little better in the newer families with 6LUTs instead of 4LUTs, but it is still big. Well, that is for add. For multiply and divide, it is much easier, though divide itself, that is, a combinatorial divider, is big. But if you do this one right, it should only have (or not) one shift after each stage.
> for the latter mode, i dunno would you > would do about the bit that falls off because there might be problems if > you always round down. you might have to round up/down as a random > outcome. maybe not.
I don't know about that one, either.
> so some entire passes are done in the normal mode and some entire passes > are done in the divide-by-2 mode. every pass that the divide-by-2 mode > is used, there is a single exponent for the whole block that you > increment by 1. when the normal mode pass is used, then the exponent is > left alone.
You should be able to factor out any exponent before the data goes in, so all you need to do is count the shifts. Though that gets a little more complicated when you do things in parallel like the OP wants.
> to decide which mode you use, assume 12 dB of headroom to start with. > the input data is "normalized" so that no sample exceeds 1/4 full scale. > it sounds like 2 bits of waste in each fixed-point sample, but you > need those for guard bits. then, for each FFT pass, you need a "sticky > bit" (that is cleared before the pass begins) that detects if any result > has a word that exceeds that 1/4 fullscale limit. if any one word, > either real or imag part, exceeds that 1/4 fullscale threshold (in abs > value), you set the sticky bit.
With the 18 bit block multipliers, bit widths like 18 and 35 are good choices.
> if that sticky bit is set at the beginning of the pass, then all of the > butterflies for the upcoming pass are in divide-by-2 mode and the sticky > bit is cleared. otherwise the butterflies are in normal mode and the > sticky bit is already cleared at the beginning of the pass.
> admittedly this is identical in concept to what the Mot (now Freescale) > DSP56K does. which is where i drew it conceptually.
> bit for bit, 32-bit fixed point is better than 32-bit IEEE floating > point unless your headroom needs to be very large (40 dB or more). > those 8 bits of exponent can be spent more effectively on mantissa.
-- glen
Reply by robert bristow-johnson December 12, 20122012-12-12
On 12/12/12 12:05 PM, gongdori wrote:
> > 3) floating point vs. fixed point - ah... This is what has been pulling my > hair. In my opinion, fixed point is good enough and it is much more > suitable in FPGA. I've seen some floating point libraries and cores, it > takes multiple clock cycles per operation. I've written VHDL code which > does n-point FFT and thinking about modifying it to handle floating point, > but I haven't done it yet. >
one idea to consider, if you're already comfortable with fixed-point FFT, is block-floating-point. your FFT butterfly will need two modes: one where the butterfly is the normal butterfly and the other where all of the results, real and imaginary for both outputs of the butterfly are divided by 2 (by use of an arithmetic shift right). for the latter mode, i dunno would you would do about the bit that falls off because there might be problems if you always round down. you might have to round up/down as a random outcome. maybe not. so some entire passes are done in the normal mode and some entire passes are done in the divide-by-2 mode. every pass that the divide-by-2 mode is used, there is a single exponent for the whole block that you increment by 1. when the normal mode pass is used, then the exponent is left alone. to decide which mode you use, assume 12 dB of headroom to start with. the input data is "normalized" so that no sample exceeds 1/4 full scale. it sounds like 2 bits of waste in each fixed-point sample, but you need those for guard bits. then, for each FFT pass, you need a "sticky bit" (that is cleared before the pass begins) that detects if any result has a word that exceeds that 1/4 fullscale limit. if any one word, either real or imag part, exceeds that 1/4 fullscale threshold (in abs value), you set the sticky bit. if that sticky bit is set at the beginning of the pass, then all of the butterflies for the upcoming pass are in divide-by-2 mode and the sticky bit is cleared. otherwise the butterflies are in normal mode and the sticky bit is already cleared at the beginning of the pass. admittedly this is identical in concept to what the Mot (now Freescale) DSP56K does. which is where i drew it conceptually. bit for bit, 32-bit fixed point is better than 32-bit IEEE floating point unless your headroom needs to be very large (40 dB or more). those 8 bits of exponent can be spent more effectively on mantissa. -- r b-j rbj@audioimagination.com "Imagination is more important than knowledge."
Reply by glen herrmannsfeldt December 12, 20122012-12-12
gongdori <16548@dsprelated> wrote:

(snip)
>>> I am interested in performing million-point FFT in FPGA.
(snip)
>>1M points would require 128Mb of storage, if using 128 bits for each >>complex point. You didn't specify a SFDR or other metric which would >>allow the number of bits to be established. I guessed at 128.
(snip)
>>> I am especially interested in a way where I can avoid parallel input.
Well, you can avoid parallel input by separating it out once it gets into the FPGA. (snip)
> Thank you all for your input. I really appreciate it. > Let me answer some of your questions:
> 1) I think about using FPGA to process big big data in shorter time. And, > I'm thinking about using Xilinx Virtex-7 part (VC707, if you are familiar > with FPGA).
Yes, but one million points isn't so big in NlogN terms, and you don't say how short a time is needed.
> 2) The FFT architecture I am thinking of is Streaming Radix-2^2 > architecture which can work on a stream of data. However, since I have much > more resource than what a single 32-k streaming FFT requires, I thought > about using multiple FFTs for parallel input (also the math was shown in > somebody's paper I found online).
Well, the whole idea behind FFT is that you can combine (usually) two FFTs of length N and get one of length 2N. Repeat that the appropriate number of times, and you get the O(NlogN). You should understand how that happens inside the algorithm, even if only for a fairly small N. You might write down a diagram showing how two 4 point FFTs are combined to make 8 points, then consider doing the two in parallel and combining the outputs. Then you should understand the difference between decimation-in-time and decimation-in-frequency. Either works, but there is a difference in the order you need the data points. (You can even mix them, if that turns out to be better.)
> 3) floating point vs. fixed point - ah... This is what has been pulling my > hair. In my opinion, fixed point is good enough and it is much more > suitable in FPGA. I've seen some floating point libraries and cores, it > takes multiple clock cycles per operation. I've written VHDL code which > does n-point FFT and thinking about modifying it to handle floating point, > but I haven't done it yet.
Yes fixed point is a better choice. You can even do it much wider than the significand of floating point would be, and use less logic. That is, compare 32 bit fixed point with 32 significant bits, to 32 bit floating point with 24 significant bits. Now, it gets more interesting when you have 18x18 block multipliers. (I believe signed, which complicates combining them.) Some bit widths will be much easier than others.
> 4) I have a plan to use PCIe interface to move data in and out...
> I have written streaming, and parallel FFTs in VHDL and thinking about how > to mix the two approaches to maximize the throughput and minimize the > latency of the design... But, if you have any other bright idea, or a good > reference, please let me know!
If you can control the order of the input data, then there are many interesting things you can do, especially as combinations of decimation in time and frequency. If you have N FFT blocks running in parallel, you can fill one block before going to the next one, or send every Nth input to an FFT block. The way you combine the outputs, and also the order that the data comes out, will be different. How many million (is it really 1000000 or is it 2**20?) point FFTs do you need, and how fast do you need them? -- glen
Reply by gongdori December 12, 20122012-12-12
>On Tue, 11 Dec 2012 07:20:07 -0600, gongdori wrote: > >> Hi all, >> >> I am interested in performing million-point FFT in FPGA. I've read some >> papers and concluded that the best way of performing million-point FFT >> would be having multiple smaller size FFT in parallel, for example 8x >> 16K-point FFT, with twiddle factor multipliers (I think it will be >> radix-2^2 streaming FFT with multiple inputs) >> But, since I am a newbie in DSP, I want to check if there is any easier >> way. I am especially interested in a way where I can avoid parallel >> input. >> >> Could you give me your idea/comment? >> Thanks! >> >> Thomas > > >1M points would require 128Mb of storage, if using 128 bits for each >complex point. You didn't specify a SFDR or other metric which would >allow the number of bits to be established. I guessed at 128. > >The largest FPGAs in the newest families from the two leading FPGA >vendors have onboard storage for about 50-60Mb. These largest parts cost
>thousands of dollars each. They will not fit all your points in onboard >RAM. > >Your 16k point suggestion would require 2Mb, again assuming 128 bits per >complex point. Suitably sized FPGAs for 16k points are much cheaper, but
>you would need an external memory to hold all 1M points. > >The disadvantage of external memory is that it becomes the limiting >factor in the throughput. In that sense, an FPGA won't do any better >than a (cheaper and easier-to-program) DSP chip which can use exactly the
>same type of memory. > >A single 16 bit wide DDR3 RAM chip requires about 40 FPGA pins and can >transfer about 10-15Gb/s. (The RAMs can go much faster, but it's hard to
>get the soft FPGA controllers to work as well as the hard controllers in >other devices.) >OTOH, you can perhaps fit a dozen such RAM interfaces into a (large) >FPGA, leading to up to 180Gb/s of data transfer. >That's a peak rate though. It can be sustained only with linear >addressing of large blocks (for some value of "large"). I'm guessing >your method of using multiple 16k FFTs requires the opposite of linear >addressing, so this may not work out too well unless you're very careful >with how you (re)order you RAM accesses. > >> I am especially interested in a way where I can avoid parallel input. > >Use serial input then. Many modern FPGAs have multiple multi-megabit >transceivers. A protocol like 10Gb/s Ethernet could give you a "fill >time" of less than 20ms. > > >Nice homework problem, BTW. > >Regards, >Allan >
Thank you all for your input. I really appreciate it. Let me answer some of your questions: 1) I think about using FPGA to process big big data in shorter time. And, I'm thinking about using Xilinx Virtex-7 part (VC707, if you are familiar with FPGA). 2) The FFT architecture I am thinking of is Streaming Radix-2^2 architecture which can work on a stream of data. However, since I have much more resource than what a single 32-k streaming FFT requires, I thought about using multiple FFTs for parallel input (also the math was shown in somebody's paper I found online). 3) floating point vs. fixed point - ah... This is what has been pulling my hair. In my opinion, fixed point is good enough and it is much more suitable in FPGA. I've seen some floating point libraries and cores, it takes multiple clock cycles per operation. I've written VHDL code which does n-point FFT and thinking about modifying it to handle floating point, but I haven't done it yet. 4) I have a plan to use PCIe interface to move data in and out... I have written streaming, and parallel FFTs in VHDL and thinking about how to mix the two approaches to maximize the throughput and minimize the latency of the design... But, if you have any other bright idea, or a good reference, please let me know! Thanks for giving me the link for Dillon and ams. I'll go through it soon! Have a nice day! Thomas
Reply by Tim Wescott December 12, 20122012-12-12
On Wed, 12 Dec 2012 04:42:06 +0000, glen herrmannsfeldt wrote:

> Tim Wescott <tim@seemywebsite.please> wrote: > (snip) >>> Then Tim Wescott <tim@seemywebsite.com> wrote: >>>> For certain problems -- and this one 'feels' like a candidate -- >>>> doing math on an FPGA can be less expensive in terms of BOM cost, >>>> space consumed by hardware, and/or power consumption. > > (snip, then I wrote) >>> Certainly that is true. But then again, you can always use a soft >>> processor on the FPGA. > >>> The OP didn't say fixed or floating point, which also might matter. > >>> It could also be a school assignment to do something on an FPGA, and >>> possibly with the requirement that it not be in a soft processor. > > (snip) > >> On the whole, I think a soft processor would be slower, bigger, hotter, >> and more expensive than a real one. > > Yes, but the FPGA has the advantage that you can put much of the other > logic that you need inside, also. > >> I was thinking more on the lines that if the task entailed slurping in >> 10^6 data points, doing an FFT, then spewing out 10^6 results as fast >> as possible, that an FPGA might (most certainly might, not "would") be >> the answer. > > The "as fast as possible" was not in the original post. For fixed point, > I would be pretty sure that the FPGA would be a good choice. I would be > a lot less sure for floating point.
True. I tend to think that if someone wants to use an FPGA for a job, sans processor, that the "as fast as possible" is implied. But that's often wrong.
> Many FPGA families include an 18x18 block multiplier. If 18 bits is > enough, then the limit is mostly how fast you can stuff numbers through > the multipliers, though a little bit moving them around.
Even if 18x18 isn't enough, the multiplier would help a lot with larger word lengths.
> I haven't gone through the bit-reversed order algorithm recently, but I > believe that if you can get them in that order it works pretty well. If > not, then you have to store them somewhere, and though FPGAs do have a > lot of memory now, maybe not that much.
They don't have that much memory yet, to my knowledge. One would need an FPGA with RAM hanging off of it. Of course, this would still present interesting challenges if the data needs to be reordered and the RAM favors burst transfers (but then, I do recall seeing an algorithm for an FFT to be performed on data stored on tape that keeps the data in order -- this would work well for an FFT on data stored in SDRAM and read out in blocks, fast).
>> Of course, if the assignment is coming from a prof, then that's all she >> wrote. > > Yes. Maybe the OP will explain more.
I assumed a lot in my responses. It would be an interesting problem if an FPGA was actually demanded for technical reasons, and not just because it's required for pedagogical reasons or because someone doesn't understand what an FPGA is good for. -- Tim Wescott Control system and signal processing consulting www.wescottdesign.com
Reply by Allan Herriman December 12, 20122012-12-12
On Tue, 11 Dec 2012 07:20:07 -0600, gongdori wrote:

> Hi all, > > I am interested in performing million-point FFT in FPGA. I've read some > papers and concluded that the best way of performing million-point FFT > would be having multiple smaller size FFT in parallel, for example 8x > 16K-point FFT, with twiddle factor multipliers (I think it will be > radix-2^2 streaming FFT with multiple inputs) > But, since I am a newbie in DSP, I want to check if there is any easier > way. I am especially interested in a way where I can avoid parallel > input. > > Could you give me your idea/comment? > Thanks! > > Thomas
1M points would require 128Mb of storage, if using 128 bits for each complex point. You didn't specify a SFDR or other metric which would allow the number of bits to be established. I guessed at 128. The largest FPGAs in the newest families from the two leading FPGA vendors have onboard storage for about 50-60Mb. These largest parts cost thousands of dollars each. They will not fit all your points in onboard RAM. Your 16k point suggestion would require 2Mb, again assuming 128 bits per complex point. Suitably sized FPGAs for 16k points are much cheaper, but you would need an external memory to hold all 1M points. The disadvantage of external memory is that it becomes the limiting factor in the throughput. In that sense, an FPGA won't do any better than a (cheaper and easier-to-program) DSP chip which can use exactly the same type of memory. A single 16 bit wide DDR3 RAM chip requires about 40 FPGA pins and can transfer about 10-15Gb/s. (The RAMs can go much faster, but it's hard to get the soft FPGA controllers to work as well as the hard controllers in other devices.) OTOH, you can perhaps fit a dozen such RAM interfaces into a (large) FPGA, leading to up to 180Gb/s of data transfer. That's a peak rate though. It can be sustained only with linear addressing of large blocks (for some value of "large"). I'm guessing your method of using multiple 16k FFTs requires the opposite of linear addressing, so this may not work out too well unless you're very careful with how you (re)order you RAM accesses.
> I am especially interested in a way where I can avoid parallel input.
Use serial input then. Many modern FPGAs have multiple multi-megabit transceivers. A protocol like 10Gb/s Ethernet could give you a "fill time" of less than 20ms. Nice homework problem, BTW. Regards, Allan
Reply by Martin Thompson December 12, 20122012-12-12
"gongdori" <16548@dsprelated> writes:

> Hi all, > > I am interested in performing million-point FFT in FPGA.
Why in an FPGA specifically? * For interest? * or because you have analysis that says an FPGA is optimal for your particular set of constraints? The first is good enough reason in itself. Otherwise until you *need* an FPGA to meet some constraint or other, steer clear. (I speak as one who "does" FPGAs for a living)
> I've read some papers and concluded that the best way of performing > million-point FFT would be having multiple smaller size FFT in > parallel, for example 8x 16K-point FFT, with twiddle factor > multipliers (I think it will be radix-2^2 streaming FFT with multiple > inputs) But, since I am a newbie in DSP, I want to check if there is > any easier way. I am especially interested in a way where I can avoid > parallel input.
2 calculations might be instructive: * How big an FPGA do you need to do a 16K point FFT - you'll probably find yourself multiplier or RAM limited rather than logic limited, so the calculations are realtively easy * For you chosen FFT setup: how long will you spend shuffling data in and out of external memory? FFTs in FPGA often turn into a memory-bandwidth-bound problem, as the memory/logic density of FPGAs doesn't match up to the cache/processing density of a high-end processor. Cheers, Martin -- martin.j.thompson@trw.com TRW Conekt - Consultancy in Engineering, Knowledge and Technology http://www.conekt.co.uk/capabilities/39-electronic-hardware
Reply by kevin December 12, 20122012-12-12
gongdori <16548@dsprelated> wrote:
> I am interested in performing million-point FFT in FPGA.
You don't mention the word size you want to use, fixed or floating point, input speed, latency requirements, complex or real-only inputs, etc., so it's difficult to say: this is what you need to look at. However, FFT architectures can be broadly categorized into 4 basic types: 1) single processor, 2) multi-processor, 3) pipeline (one processor for each butterfly column), or 4) systolic (one processor per butterfly). Within the 4 categories, there are a large number of variations. Rabiner and Gold's book: Theory and Application of Digital Signal Processing (Chapter 10), would be a good place to start. Pipeline architectures can be very high-speed, while minimizing the input/output requirements. Data streams into one end, and, after some latency, streams out the other. Most architectures operate continuously, with no breaks on the input or output (ie: the first million data points to be transformed are followed by the second million to be transformed, etc.). Depending on the algorithm/ architecture, bit-reversal can be done on the input, on the output, or within the processor by proper buffering. A relatively straightforward design would be a 1024 point pipeline, followed by a million word buffer, followed by another 1024 point pipeline. Each 1024 point pipeline could be configured as 10 radix-2 stages, but there are a great many variations possible, depending on the requirements. There are companies that produce FPGA designs for FFT pipelines, such as Dillon Engineering, and Annapolis Microsystems: http://ecat.edacafe.com/corpprofile.php?vendor_id=9001925 http://www.annapmicro.com/aboutus.html But I think you'll have to be more specific about your requirements before you proceed. Kevin