# Matrix multiplication for the VC33

Started by April 16, 2005

Hi to all,

I'm very interested in matrix multiplication subject for the VC33.  In the past, I already had to implement a matrix multiplication algorithm in assembler for the VC33 (Extended Kalman Filter 30x30 matrices). I observed a large improvement in the performance (comparing to C)! Does anyone have any knowledge or documentation on this field? I implemented the matrix multiplication for matrices stored in column wise fashion (as in Fortran). Does anyone have any asm library for matrices calculations? What about the way of storing the matrices in memory! Any comments or information in this field is welcome! Even for C coding. I would like to change some ideas or experience!

I'm using Code Composer, but in terms of simulation and profiling, has a lot of bugs! Opinions? alternatives?

Best Regards

Bruno

Hi Bruno

Back when I was with TI I coded up this algorithm in ASM, but danged if I dont have it at my finger tips!  Luckily MMPY is pretty simple even though C (and CPP) make it looks pretty trashy.  Hopefully I get these words right, but in C/CPP
statically defined X and Y dimensions are fine, but dynamic dimensions are impossible since this would involve some kind of recomputation of base and offset addresses.  The bottom line is that you cannot simply pass a pointer to a matrix, have *variable* X and Y dimensions known, and simply calculate the row width or column height.  I needed this ability, so I came up with the following code. typedef struct MATRIX
{
long x;
long y;
double m[MAXNODES_SQRD];
}_matrix_;

/*-------

+---------------------+
| +-------+           |       dimrow(a)==dimcol(b)
2x4       4x3         2x3
+-----------+
|b b b|
|a a a a| |b b b| = |c c c|
|a a a a| |b b b|   |c c c|
|b b b|
--------*/
void MMPY(MATRIX *C, MATRIX *A, MATRIX *B) // C=A*B;
{
long ax,ay,bx;
double S, *a, *b, *c;
if( Ax!=By               ) MFAIL(0);
if((Ax *By)>MAXNODES_SQRD) MFAIL(1);
Cx=Bx; Cy=Ay;
a=A->m;
c=C->m;
for(ay=0;ay<Ay;ay++)
{
for(bx=0;bx<Bx;bx++)
{
a = (A->m);  a+=(ay*Ax);
b = (B->m);  b+=bx;
S=0;
for(ax=0;ax<Ax;ax++)
{ S += *a++ * *b;
b+=Bx;
}
*c++ = S;
}
}
}

The nice thing about the DSP is that it has the ability to manipulate addresses in this fashion with ease, not to mention the ability to read and write multiple data items per cycle.  And, if you look carefully, a few things become redundant.  In the end you should be able to create a DSP MMPY that approaches X*Y cycles plus some nominal overhead.  My recollection is that for an arbitrary X-Y dimensioned matrix, this would come out something like X*(Y+3)+6 cycles.

Hope this helps,
Keith Larson

Bruno wrote:

Hi to all,

I'm very interested in matrix multiplication subject for the VC33.  In the past, I already had to implement a matrix multiplication algorithm in assembler for the VC33 (Extended Kalman Filter 30x30 matrices). I observed a large improvement in the performance (comparing to C)! Does anyone have any knowledge or documentation on this field? I implemented the matrix multiplication for matrices stored in column wise fashion (as in Fortran). Does anyone have any asm library for matrices calculations? What about the way of storing the matrices in memory! Any comments or information in this field is welcome! Even for C coding. I would like to change some ideas or experience!

I'm using Code Composer, but in terms of simulation and profiling, has a lot of bugs! Opinions? alternatives?

Best Regards

Bruno

NEW!  You can now post a message or access and search the archives of this group on DSPRelated.com:
http://www.dsprelated.com/groups/c3x/1.php

_____________________________________

_____________________________________

Archives:  http://www.dsprelated.com/groups/c3x/1.php

To Post:  Send an email to c...@yahoogroups.com

Other DSP Related Groups: http://www.dsprelated.com/groups.php

Hello again Bruno

As I was getting up this morning it occurred to me that I had made a mistake (ackk!).

If the dimensions are X1:Y1 and X2:Y2 yielding dimension X1:Y2 there is another outer loop to consider.  If I get this right... Y1 and X2 will be the same, and each output element will take Y1 MAC operations to compute.   There are X1*Y2 outputs, so we get Y1*(X1*Y2) cycles, plus overhead.  The good news is that the inner loop operation is a MAC, which the DSP is pretty efficient at doing.

This brings up a comment about matrix add/subtract and how it maps into DSP execution cycles.  The inner loop is quite simple, but you should also notice that there are two memory reads and one memory write per loop.  If you then look at an ADDF/ADDI opcode you will see that it can access at most two memory operands at a given time.

Another thing to consider is that event though this DSP can do two memory reads, two memory writes or a read/write pair in one cycle (not many DSP's can do this by the way) if the write happens to be to the same memory block as one (or either?) of the two reads, I am pretty sure this will create a pipeline stall.  This occurs because the write is posted to its destination bus and actually occurs in the next cycle.  The bottom line here is that juggling the pointers around to use different memory blocks may help.

Interestingly however when the write is to a different block, or even external memory... the posting effect becomes an definite advantage :-)  Basically this is like having a one level deep write cache.  Therefor if the reads are from internal memory and the destination is (0 wait state) external memory there will be NO SLOWDOWN.

I simply typed in this ASM code for MMADD, but is should be pretty close to correct.  It uses global parameters rather than stack passed variables, so you will need to modify it for sure.  The prototype would be

Again,
Hope this helps, and sorry for the confused response

Best regards
Keith Larson

;--------------
;
; extern MATRIX A;
; extern MATRIX B;
; extern MATRIX C;
;--------------
.global _A,_B,_C
MADD      ldi   @_A,AR0          ; pointer to matrix A
ldi   @_B,AR1          ; pointer to matrix B
ldi   @_C,AR2          ; pointer to matrix C
ldi   *AR0++,R0        ; Ax
ldi   *AR0++,R1        ; Ay
|| sti   R0,*AR2++        ; save dimension Ax to Cx
mpyi  R0,R1,RC         ; Ax*Ay
sti   R1,*AR2++        ; save dimension Ay to Cy
addi  2,AR1            ; dont forget to increment B pointer
subi  1,RC             ; loop is RC+1
MMADLOOP  stf  R0,*AR2++         ; save result
rets                   ;

/*------------------
+---------------------+
| +-------+           |     dimrow(a)==dimcol(b)
2x3       2x3         2x3   dimcol(a)==dimrow(b)
+-----------+

|a a a| |b b b| = |c c c|
|a a a| |b b b|   |c c c|
--------------------*/
void MADD(MATRIX huge *C, MATRIX huge *A, MATRIX huge *B) // C=A+B;
{
long n;
double *a,*b,*c;
if( Ax!=Bx ) MFAIL(2);
if( Ay!=By ) MFAIL(3);
Cx=Ax; Cy=Ay;
a=A->m;
b=B->m;
c=C->m;
for(n=0;n<Ax*Ay;n++)
*c++ = *a++ + *b++;
}

Keith Larson wrote:
Hi Bruno

Back when I was with TI I coded up this algorithm in ASM, but danged if I dont have it at my finger tips!  Luckily MMPY is pretty simple even though C (and CPP) make it looks pretty trashy.  Hopefully I get these words right, but in C/CPP
statically defined X and Y dimensions are fine, but dynamic dimensions are impossible since this would involve some kind of recomputation of base and offset addresses.  The bottom line is that you cannot simply pass a pointer to a matrix, have *variable* X and Y dimensions known, and simply calculate the row width or column height.  I needed this ability, so I came up with the following code. typedef struct MATRIX
{
long x;
long y;
double m[MAXNODES_SQRD];
}_matrix_;

/*-------

+---------------------+
| +-------+           |       dimrow(a)==dimcol(b)
2x4       4x3         2x3
+-----------+
|b b b|
|a a a a| |b b b| = |c c c|
|a a a a| |b b b|   |c c c|
|b b b|
--------*/
void MMPY(MATRIX *C, MATRIX *A, MATRIX *B) // C=A*B;
{
long ax,ay,bx;
double S, *a, *b, *c;
if( Ax!=By               ) MFAIL(0);
if((Ax *By)>MAXNODES_SQRD) MFAIL(1);
Cx=Bx; Cy=Ay;
a=A->m;
c=C->m;
for(ay=0;ay<Ay;ay++)
{
for(bx=0;bx<Bx;bx++)
{
a = (A->m);  a+=(ay*Ax);
b = (B->m);  b+=bx;
S=0;
for(ax=0;ax<Ax;ax++)
{ S += *a++ * *b;
b+=Bx;
}
*c++ = S;
}
}
}

The nice thing about the DSP is that it has the ability to manipulate addresses in this fashion with ease, not to mention the ability to read and write multiple data items per cycle.  And, if you look carefully, a few things become redundant.  In the end you should be able to create a DSP MMPY that approaches X*Y cycles plus some nominal overhead.  My recollection is that for an arbitrary X-Y dimensioned matrix, this would come out something like X*(Y+3)+6 cycles.

Hope this helps,
Keith Larson

Bruno wrote:

Hi to all,

I'm very interested in matrix multiplication subject for the VC33.  In the past, I already had to implement a matrix multiplication algorithm in assembler for the VC33 (Extended Kalman Filter 30x30 matrices). I observed a large improvement in the performance (comparing to C)! Does anyone have any knowledge or documentation on this field? I implemented the matrix multiplication for matrices stored in column wise fashion (as in Fortran). Does anyone have any asm library for matrices calculations? What about the way of storing the matrices in memory! Any comments or information in this field is welcome! Even for C coding. I would like to change some ideas or experience!

I'm using Code Composer, but in terms of simulation and profiling, has a lot of bugs! Opinions? alternatives?

Best Regards

Bruno

NEW!  You can now post a message or access and search the archives of this group on DSPRelated.com:
http://www.dsprelated.com/groups/c3x/1.php

_____________________________________

_____________________________________

Archives:  http://www.dsprelated.com/groups/c3x/1.php

To Post:  Send an email to c...@yahoogroups.com

Other DSP Related Groups: http://www.dsprelated.com/groups.php

NEW!  You can now post a message or access and search the archives of this group on DSPRelated.com:
http://www.dsprelated.com/groups/c3x/1.php

_____________________________________