For engineers implementing DSP functions on FPGAs. This is a NEW Group that has just been created. It should take a few weeks before the group is big enough to become active. Please join!
Hi. I need to develop a hardware for hi-performance matrix-to-vector multiplications. Say, a hardware that calculates y = Ax Being x and y vectors and A a matrix. Vectors have a few thousand elements and A is a rectangular matrix of n*m elements, being n and m about a few thousand elements. Element data types are 16bit but it's needed an accumulator of about 48-bit. Variable x changes frequently but A doesn't (perhaps it can be fixed in certain situations). It's needed to do the calculation as fast as possible, in the order of 100 times a second. The bottleneck is clearly in memory bandwidth. I thought about using a 64-bit wide DDR2 SDRAM for A, a 16-bit wide for x and a 16-bit memory for y. This architecture isn't very usual, for example standard development boards can't handle this problem very well, because of single 16 or 32bit SDRAM. For example, if A=4096x8192 and x=8192 element vector, there's needed (for 100 calculations a second), more than 3 GB/s memory bandwidth only to read A (it haves about 64MB in this case). The question is what do you suggest or what can I expect with current technology. Some basic question I have is how much memory bandwidth can I expect in a real setup for example using a low cost chip like Spartan-3E (additionaly, low overall cost of production is needed, i know it complicates things but it's also needed). Thanks a lot in advance.
Hi. I'd appreciate more help with this. My tentative architecture is like this, but don't know for sure if it will work on practice: 1 unit of FPGA XC3S200A-5FTG256C that has: 195 I/O pins 16384x18bit internal block RAM 16 units of 18x18bit hardware multipliers Price: about $22 (one chip) 8 units of SDRAM like: MT47H16M16BG-5E (DDR2-400) or MT47H16M16BG-3 (DDR2-667), or MT46V16M16P-5B (DDR-400) They have 16-bit data width on each chip (total bus width 128bits) What I have undestood is that each chip can put data at 400Mbits/s on each data pin. Price: about $10 each chip (similar prices all speed grades) mathematical operation to be done is: y = Ax where A is a matrix of size = 4096x8192x16-bit = 64MByte (to store on DRAM) x is a vector of size = 8192x18-bit (to store on half of FPGA block RAM) y is a vector of size = 4096x18-bit (to store on third-quarter of FPGA block RAM) The FPGA has 195 I/O pins, I need 128 for data and others for address. Theoretical memory bandwidth on SDRAM = 128-bitx400Mbit/s = 6400MByte/s memory bandwidth / memory size = (6400MByte/s)/(64MByte/s) = thus, can do about 100 calculations/s In this setup I need to do 8 parallel multiply-accumulations at 400Mhz, but multipliers on FPGA can't do at this rate (they can do at about 250 Mhz). Using two pipelined multipliers it seems posible to overcome the problem (and thus consumes all 16 multipliers on FPGA). I don't know if there would be enough time for the additional 48-bit accumulation. Do I need an FPGA designed for DSP (with multiply-accumulate blocks), like Spartan3A-DSP? I think block RAM can be used for storing vectors, but (also) don't know for sure. Is there any problem with this? Do I need to put these information in other external RAM chips? Well, these are my toughts. I need to verify if this architecture makes sense, because if I pass this design to a hardware hacker to make the board, but then I can't achieve more than about 60 calculations/s (60% of theoretical maximum), I would be in problems... Is it better to use DDR-400 instead of DDR2-400? It seems that board layout and other considerations are easier with DDR. Other questions: What about needed FPGA and DRAM clock rates? What about FPGA utilization? Well, the question is: Am I correct with my architecture and calculations?? Really thanks in advance. Cheers, Victor. On Tue, Apr 29, 2008 at 9:47 AM, <f...@yahoogroups.com> wrote: > > DSP & FPGA > > Messages In This Digest (1 Message) > 1. High performance matrix multiplication hardware From: Victor Suarez > View All Topics | Create New Topic > Message > 1. > High performance matrix multiplication hardware > Posted by: "Victor Suarez" s...@gmail.com manuko1977 > Tue Apr 29, 2008 4:13 am (PDT) > Hi. > I need to develop a hardware for hi-performance matrix-to-vector > multiplications. > Say, a hardware that calculates > > y = Ax > > Being x and y vectors and A a matrix. Vectors have a few thousand > elements and A is a rectangular matrix of n*m elements, being n and m > about a few thousand elements. Element data types are 16bit but it's > needed an accumulator of about 48-bit. Variable x changes frequently > but A doesn't (perhaps it can be fixed in certain situations). > > It's needed to do the calculation as fast as possible, in the order of > 100 times a second. The bottleneck is clearly in memory bandwidth. I > thought about using a 64-bit wide DDR2 SDRAM for A, a 16-bit wide for > x and a 16-bit memory for y. This architecture isn't very usual, for > example standard development boards can't handle this problem very > well, because of single 16 or 32bit SDRAM. > > For example, if A=4096x8192 and x=8192 element vector, there's needed > (for 100 calculations a second), more than 3 GB/s memory bandwidth > only to read A (it haves about 64MB in this case). > > The question is what do you suggest or what can I expect with current > technology. Some basic question I have is how much memory bandwidth > can I expect in a real setup for example using a low cost chip like > Spartan-3E (additionaly, low overall cost of production is needed, i > know it complicates things but it's also needed). > > Thanks a lot in advance. >