Hi.

I need to develop a hardware for hi-performance matrix-to-vector

multiplications.

Say, a hardware that calculates

y = Ax

Being x and y vectors and A a matrix. Vectors have a few thousand

elements and A is a rectangular matrix of n*m elements, being n and m

about a few thousand elements. Element data types are 16bit but it's

needed an accumulator of about 48-bit. Variable x changes frequently

but A doesn't (perhaps it can be fixed in certain situations).

It's needed to do the calculation as fast as possible, in the order of

100 times a second. The bottleneck is clearly in memory bandwidth. I

thought about using a 64-bit wide DDR2 SDRAM for A, a 16-bit wide for

x and a 16-bit memory for y. This architecture isn't very usual, for

example standard development boards can't handle this problem very

well, because of single 16 or 32bit SDRAM.

For example, if A@96x8192 and x92 element vector, there's needed

(for 100 calculations a second), more than 3 GB/s memory bandwidth

only to read A (it haves about 64MB in this case).

The question is what do you suggest or what can I expect with current

technology. Some basic question I have is how much memory bandwidth

can I expect in a real setup for example using a low cost chip like

Spartan-3E (additionaly, low overall cost of production is needed, i

know it complicates things but it's also needed).

Thanks a lot in advance.

# High performance matrix multiplication hardware

Started by ●April 29, 2008

Reply by ●April 30, 20082008-04-30

Hi.

I'd appreciate more help with this.

My tentative architecture is like this, but don't know for sure if it

will work on practice:

1 unit of FPGA XC3S200A-5FTG256C that has:

195 I/O pins

16384x18bit internal block RAM

16 units of 18x18bit hardware multipliers

Price: about $22 (one chip)

8 units of SDRAM like:

MT47H16M16BG-5E (DDR2-400) or MT47H16M16BG-3 (DDR2-667), or

MT46V16M16P-5B (DDR-400)

They have 16-bit data width on each chip (total bus width 128bits)

What I have undestood is that each chip can put data at 400Mbits/s on

each data pin.

Price: about $10 each chip (similar prices all speed grades)

mathematical operation to be done is:

y = Ax

where

A is a matrix of size = 4096x8192x16-bit = 64MByte (to store on DRAM)

x is a vector of size = 8192x18-bit (to store on half of FPGA block RAM)

y is a vector of size = 4096x18-bit (to store on third-quarter of FPGA

block RAM)

The FPGA has 195 I/O pins, I need 128 for data and others for address.

Theoretical memory bandwidth on SDRAM = 128-bitx400Mbit/s = 6400MByte/s

memory bandwidth / memory size = (6400MByte/s)/(64MByte/s) = thus, can

do about 100 calculations/s

In this setup I need to do 8 parallel multiply-accumulations at

400Mhz, but multipliers on FPGA can't do at this rate (they can do at

about 250 Mhz). Using two pipelined multipliers it seems posible to

overcome the problem (and thus consumes all 16 multipliers on FPGA). I

don't know if there would be enough time for the additional 48-bit

accumulation. Do I need an FPGA designed for DSP (with

multiply-accumulate blocks), like Spartan3A-DSP?

I think block RAM can be used for storing vectors, but (also) don't

know for sure. Is there any problem with this? Do I need to put these

information in other external RAM chips?

Well, these are my toughts. I need to verify if this architecture

makes sense, because if I pass this design to a hardware hacker to

make the board, but then I can't achieve more than about 60

calculations/s (60% of theoretical maximum), I would be in problems...

Is it better to use DDR-400 instead of DDR2-400? It seems that board

layout and other considerations are easier with DDR.

Other questions: What about needed FPGA and DRAM clock rates? What

about FPGA utilization?

Well, the question is: Am I correct with my architecture and calculations??

Really thanks in advance.

Cheers,

Victor.

On Tue, Apr 29, 2008 at 9:47 AM, wrote:

>

> DSP & FPGA

>

> Messages In This Digest (1 Message)

> 1. High performance matrix multiplication hardware From: Victor Suarez

> View All Topics | Create New Topic

> Message

> 1.

> High performance matrix multiplication hardware

> Posted by: "Victor Suarez" s...@gmail.com manuko1977

> Tue Apr 29, 2008 4:13 am (PDT)

> Hi.

> I need to develop a hardware for hi-performance matrix-to-vector

> multiplications.

> Say, a hardware that calculates

>

> y = Ax

>

> Being x and y vectors and A a matrix. Vectors have a few thousand

> elements and A is a rectangular matrix of n*m elements, being n and m

> about a few thousand elements. Element data types are 16bit but it's

> needed an accumulator of about 48-bit. Variable x changes frequently

> but A doesn't (perhaps it can be fixed in certain situations).

>

> It's needed to do the calculation as fast as possible, in the order of

> 100 times a second. The bottleneck is clearly in memory bandwidth. I

> thought about using a 64-bit wide DDR2 SDRAM for A, a 16-bit wide for

> x and a 16-bit memory for y. This architecture isn't very usual, for

> example standard development boards can't handle this problem very

> well, because of single 16 or 32bit SDRAM.

>

> For example, if A@96x8192 and x92 element vector, there's needed

> (for 100 calculations a second), more than 3 GB/s memory bandwidth

> only to read A (it haves about 64MB in this case).

>

> The question is what do you suggest or what can I expect with current

> technology. Some basic question I have is how much memory bandwidth

> can I expect in a real setup for example using a low cost chip like

> Spartan-3E (additionaly, low overall cost of production is needed, i

> know it complicates things but it's also needed).

>

> Thanks a lot in advance.

>

I'd appreciate more help with this.

My tentative architecture is like this, but don't know for sure if it

will work on practice:

1 unit of FPGA XC3S200A-5FTG256C that has:

195 I/O pins

16384x18bit internal block RAM

16 units of 18x18bit hardware multipliers

Price: about $22 (one chip)

8 units of SDRAM like:

MT47H16M16BG-5E (DDR2-400) or MT47H16M16BG-3 (DDR2-667), or

MT46V16M16P-5B (DDR-400)

They have 16-bit data width on each chip (total bus width 128bits)

What I have undestood is that each chip can put data at 400Mbits/s on

each data pin.

Price: about $10 each chip (similar prices all speed grades)

mathematical operation to be done is:

y = Ax

where

A is a matrix of size = 4096x8192x16-bit = 64MByte (to store on DRAM)

x is a vector of size = 8192x18-bit (to store on half of FPGA block RAM)

y is a vector of size = 4096x18-bit (to store on third-quarter of FPGA

block RAM)

The FPGA has 195 I/O pins, I need 128 for data and others for address.

Theoretical memory bandwidth on SDRAM = 128-bitx400Mbit/s = 6400MByte/s

memory bandwidth / memory size = (6400MByte/s)/(64MByte/s) = thus, can

do about 100 calculations/s

In this setup I need to do 8 parallel multiply-accumulations at

400Mhz, but multipliers on FPGA can't do at this rate (they can do at

about 250 Mhz). Using two pipelined multipliers it seems posible to

overcome the problem (and thus consumes all 16 multipliers on FPGA). I

don't know if there would be enough time for the additional 48-bit

accumulation. Do I need an FPGA designed for DSP (with

multiply-accumulate blocks), like Spartan3A-DSP?

I think block RAM can be used for storing vectors, but (also) don't

know for sure. Is there any problem with this? Do I need to put these

information in other external RAM chips?

Well, these are my toughts. I need to verify if this architecture

makes sense, because if I pass this design to a hardware hacker to

make the board, but then I can't achieve more than about 60

calculations/s (60% of theoretical maximum), I would be in problems...

Is it better to use DDR-400 instead of DDR2-400? It seems that board

layout and other considerations are easier with DDR.

Other questions: What about needed FPGA and DRAM clock rates? What

about FPGA utilization?

Well, the question is: Am I correct with my architecture and calculations??

Really thanks in advance.

Cheers,

Victor.

On Tue, Apr 29, 2008 at 9:47 AM, wrote:

>

> DSP & FPGA

>

> Messages In This Digest (1 Message)

> 1. High performance matrix multiplication hardware From: Victor Suarez

> View All Topics | Create New Topic

> Message

> 1.

> High performance matrix multiplication hardware

> Posted by: "Victor Suarez" s...@gmail.com manuko1977

> Tue Apr 29, 2008 4:13 am (PDT)

> Hi.

> I need to develop a hardware for hi-performance matrix-to-vector

> multiplications.

> Say, a hardware that calculates

>

> y = Ax

>

> Being x and y vectors and A a matrix. Vectors have a few thousand

> elements and A is a rectangular matrix of n*m elements, being n and m

> about a few thousand elements. Element data types are 16bit but it's

> needed an accumulator of about 48-bit. Variable x changes frequently

> but A doesn't (perhaps it can be fixed in certain situations).

>

> It's needed to do the calculation as fast as possible, in the order of

> 100 times a second. The bottleneck is clearly in memory bandwidth. I

> thought about using a 64-bit wide DDR2 SDRAM for A, a 16-bit wide for

> x and a 16-bit memory for y. This architecture isn't very usual, for

> example standard development boards can't handle this problem very

> well, because of single 16 or 32bit SDRAM.

>

> For example, if A@96x8192 and x92 element vector, there's needed

> (for 100 calculations a second), more than 3 GB/s memory bandwidth

> only to read A (it haves about 64MB in this case).

>

> The question is what do you suggest or what can I expect with current

> technology. Some basic question I have is how much memory bandwidth

> can I expect in a real setup for example using a low cost chip like

> Spartan-3E (additionaly, low overall cost of production is needed, i

> know it complicates things but it's also needed).

>

> Thanks a lot in advance.

>