DSPRelated.com
Forums

High performance matrix multiplication hardware

Started by Victor Suarez April 29, 2008
Hi.
I need to develop a hardware for hi-performance matrix-to-vector
multiplications.
Say, a hardware that calculates

y = Ax

Being x and y vectors and A a matrix. Vectors have a few thousand
elements and A is a rectangular matrix of n*m elements, being n and m
about a few thousand elements. Element data types are 16bit but it's
needed an accumulator of about 48-bit. Variable x changes frequently
but A doesn't (perhaps it can be fixed in certain situations).

It's needed to do the calculation as fast as possible, in the order of
100 times a second. The bottleneck is clearly in memory bandwidth. I
thought about using a 64-bit wide DDR2 SDRAM for A, a 16-bit wide for
x and a 16-bit memory for y. This architecture isn't very usual, for
example standard development boards can't handle this problem very
well, because of single 16 or 32bit SDRAM.

For example, if A@96x8192 and x92 element vector, there's needed
(for 100 calculations a second), more than 3 GB/s memory bandwidth
only to read A (it haves about 64MB in this case).

The question is what do you suggest or what can I expect with current
technology. Some basic question I have is how much memory bandwidth
can I expect in a real setup for example using a low cost chip like
Spartan-3E (additionaly, low overall cost of production is needed, i
know it complicates things but it's also needed).

Thanks a lot in advance.
Hi.
I'd appreciate more help with this.

My tentative architecture is like this, but don't know for sure if it
will work on practice:

1 unit of FPGA XC3S200A-5FTG256C that has:
195 I/O pins
16384x18bit internal block RAM
16 units of 18x18bit hardware multipliers
Price: about $22 (one chip)

8 units of SDRAM like:
MT47H16M16BG-5E (DDR2-400) or MT47H16M16BG-3 (DDR2-667), or
MT46V16M16P-5B (DDR-400)
They have 16-bit data width on each chip (total bus width 128bits)
What I have undestood is that each chip can put data at 400Mbits/s on
each data pin.
Price: about $10 each chip (similar prices all speed grades)

mathematical operation to be done is:
y = Ax

where
A is a matrix of size = 4096x8192x16-bit = 64MByte (to store on DRAM)
x is a vector of size = 8192x18-bit (to store on half of FPGA block RAM)
y is a vector of size = 4096x18-bit (to store on third-quarter of FPGA
block RAM)

The FPGA has 195 I/O pins, I need 128 for data and others for address.

Theoretical memory bandwidth on SDRAM = 128-bitx400Mbit/s = 6400MByte/s
memory bandwidth / memory size = (6400MByte/s)/(64MByte/s) = thus, can
do about 100 calculations/s

In this setup I need to do 8 parallel multiply-accumulations at
400Mhz, but multipliers on FPGA can't do at this rate (they can do at
about 250 Mhz). Using two pipelined multipliers it seems posible to
overcome the problem (and thus consumes all 16 multipliers on FPGA). I
don't know if there would be enough time for the additional 48-bit
accumulation. Do I need an FPGA designed for DSP (with
multiply-accumulate blocks), like Spartan3A-DSP?

I think block RAM can be used for storing vectors, but (also) don't
know for sure. Is there any problem with this? Do I need to put these
information in other external RAM chips?

Well, these are my toughts. I need to verify if this architecture
makes sense, because if I pass this design to a hardware hacker to
make the board, but then I can't achieve more than about 60
calculations/s (60% of theoretical maximum), I would be in problems...

Is it better to use DDR-400 instead of DDR2-400? It seems that board
layout and other considerations are easier with DDR.

Other questions: What about needed FPGA and DRAM clock rates? What
about FPGA utilization?

Well, the question is: Am I correct with my architecture and calculations??
Really thanks in advance.

Cheers,
Victor.
On Tue, Apr 29, 2008 at 9:47 AM, wrote:
>
> DSP & FPGA
>
> Messages In This Digest (1 Message)
> 1. High performance matrix multiplication hardware From: Victor Suarez
> View All Topics | Create New Topic
> Message
> 1.
> High performance matrix multiplication hardware
> Posted by: "Victor Suarez" s...@gmail.com manuko1977
> Tue Apr 29, 2008 4:13 am (PDT)
> Hi.
> I need to develop a hardware for hi-performance matrix-to-vector
> multiplications.
> Say, a hardware that calculates
>
> y = Ax
>
> Being x and y vectors and A a matrix. Vectors have a few thousand
> elements and A is a rectangular matrix of n*m elements, being n and m
> about a few thousand elements. Element data types are 16bit but it's
> needed an accumulator of about 48-bit. Variable x changes frequently
> but A doesn't (perhaps it can be fixed in certain situations).
>
> It's needed to do the calculation as fast as possible, in the order of
> 100 times a second. The bottleneck is clearly in memory bandwidth. I
> thought about using a 64-bit wide DDR2 SDRAM for A, a 16-bit wide for
> x and a 16-bit memory for y. This architecture isn't very usual, for
> example standard development boards can't handle this problem very
> well, because of single 16 or 32bit SDRAM.
>
> For example, if A@96x8192 and x92 element vector, there's needed
> (for 100 calculations a second), more than 3 GB/s memory bandwidth
> only to read A (it haves about 64MB in this case).
>
> The question is what do you suggest or what can I expect with current
> technology. Some basic question I have is how much memory bandwidth
> can I expect in a real setup for example using a low cost chip like
> Spartan-3E (additionaly, low overall cost of production is needed, i
> know it complicates things but it's also needed).
>
> Thanks a lot in advance.
>