I'm very interested in matrix
multiplication subject for the VC33. In the past, I already had to
implement a matrix multiplication algorithm in assembler for the VC33 (Extended
Kalman Filter 30x30 matrices). I observed a large improvement in the performance
(comparing to C)! Does anyone have any knowledge or documentation on this field?
I implemented the matrix multiplication for matrices stored in column wise
fashion (as in Fortran). Does anyone have any asm library for matrices
calculations? What about the way of storing the matrices in memory! Any comments
or information in this field is welcome! Even for C coding. I would like to
change some ideas or experience!
I'm using Code Composer, but in terms
of simulation and profiling, has a lot of bugs! Opinions?
alternatives?
Best Regards
Bruno
Reply by Keith Larson●April 18, 200520050418
Hi Bruno
Back when I was with TI I coded up this algorithm in ASM, but danged if
I dont have it at my finger tips! Luckily MMPY is pretty simple even
though C (and CPP) make it looks pretty trashy. Hopefully I get these
words right, but in C/CPP statically
defined X and Y dimensions are fine, but dynamic dimensions are
impossible since this would involve some kind of recomputation of base
and offset addresses. The bottom line is that you
cannot simply pass a pointer to a matrix, have *variable* X and Y
dimensions known, and simply calculate the row width or column height.
I needed this ability, so I came up with the following code.
typedef struct MATRIX
{
long x;
long y;
double m[MAXNODES_SQRD];
}_matrix_;
/*
++

++
 dimrow(a)==dimcol(b)
2x4
4x3 2x3
++
b b b
a a a a b b b = c c c
a a a a b b b c c c
b b
b
*/
void MMPY(MATRIX *C, MATRIX *A, MATRIX *B) // C=A*B;
{
long ax,ay,bx;
double S, *a, *b, *c;
if(
Ax!=By
) MFAIL(0);
if((Ax *By)>MAXNODES_SQRD) MFAIL(1);
Cx=Bx; Cy=Ay;
a=A>m;
c=C>m;
for(ay=0;ay<Ay;ay++)
{
for(bx=0;bx<Bx;bx++)
{
a = (A>m); a+=(ay*Ax);
b = (B>m); b+=bx;
S=0;
for(ax=0;ax<Ax;ax++)
{ S += *a++ * *b;
b+=Bx;
}
*c++ = S;
}
}
}
The nice thing about the DSP is that it has the ability to manipulate
addresses in this fashion with ease, not to mention the ability to read
and write multiple data items per cycle. And, if you look carefully, a
few things become redundant. In the end you should be able to create a
DSP MMPY that approaches X*Y cycles plus some nominal overhead. My
recollection is that for an arbitrary XY dimensioned matrix, this
would come out something like X*(Y+3)+6 cycles.
Hope this helps,
Keith Larson
Bruno wrote:
Hi to all,
I'm very interested in
matrix multiplication subject for the VC33. In the past, I
already had
to implement a matrix multiplication algorithm in assembler for the
VC33 (Extended Kalman Filter 30x30 matrices). I observed a large
improvement in the performance (comparing to C)! Does anyone have any
knowledge or documentation on this field? I implemented the matrix
multiplication for matrices stored in column wise fashion (as in
Fortran). Does anyone have any asm library for matrices calculations?
What about the way of storing the matrices in memory! Any comments or
information in this field is welcome! Even for C coding. I would like
to change some ideas or experience!
I'm using Code
Composer, but in terms of simulation and profiling, has a lot of bugs!
Opinions? alternatives?
_____________________________________
Note: If you do a simple "reply" with your email client, only the
author of this message will receive your answer. You need to do a
"reply all" if you want your answer to be distributed to the entire
group.
_____________________________________
About this discussion group:
As I was getting up this morning it occurred to me that I had made a
mistake (ackk!).
If the dimensions are X1:Y1 and X2:Y2 yielding dimension X1:Y2 there is
another outer loop to consider. If I get this right... Y1 and X2 will
be the same, and each output element will take Y1 MAC operations to
compute. There are X1*Y2 outputs, so we get Y1*(X1*Y2) cycles, plus
overhead. The good news is that the inner loop operation is a MAC,
which the DSP is pretty efficient at doing.
This brings up a comment about matrix add/subtract and how it maps into
DSP execution cycles. The inner loop is quite simple, but you should
also notice that there are two memory reads and one memory write per
loop. If you then look at an ADDF/ADDI opcode you will see that it can
access at most two memory operands at a given time.
Another thing to consider is that event though this DSP can do two
memory reads, two memory writes or a read/write pair in one cycle (not
many DSP's can do this by the way) if the write happens to be to the
same memory block as one (or either?) of the two reads, I am pretty
sure this will create a pipeline stall. This occurs because the write
is posted to its destination bus and actually occurs in the next
cycle. The bottom line here is that juggling the pointers around to
use different memory blocks may help.
Interestingly however when the write is to a different block, or even
external memory... the posting effect becomes an definite advantage
:) Basically this is like having a one level deep write cache.
Therefor if the reads are from internal memory and the destination is
(0 wait state) external memory there will be NO SLOWDOWN.
I simply typed in this ASM code for MMADD, but is should be pretty
close to correct. It uses global parameters rather than stack passed
variables, so you will need to modify it for sure. The prototype would
be
Again,
Hope this helps, and sorry for the confused response
Best regards
Keith Larson
;
;
; void MMADD(void);
; extern MATRIX A;
; extern MATRIX B;
; extern MATRIX C;
;
.global _A,_B,_C
MADD ldi
@_A,AR0 ; pointer to
matrix A
ldi @_B,AR1 ;
pointer
to matrix B
ldi @_C,AR2 ;
pointer
to matrix C
ldi *AR0++,R0 ; Ax
ldi
*AR0++,R1 ; Ay
 sti
R0,*AR2++ ; save dimension Ax to Cx
mpyi
R0,R1,RC ; Ax*Ay
sti
R1,*AR2++ ; save dimension Ay to Cy
addi
2,AR1 ; dont
forget to increment B pointer
subi
1,RC ; loop is
RC+1
rptb
MMADDLOOP ;
addf
*AR0++,*AR1++,R0 ; read and add operands
MMADLOOP stf
R0,*AR2++ ; save result
rets
;
a a a b b b = c c c
a a a b b b c c c
*/
void MADD(MATRIX huge *C, MATRIX huge *A, MATRIX huge *B) // C=A+B;
{
long n;
double *a,*b,*c;
if( Ax!=Bx ) MFAIL(2);
if( Ay!=By ) MFAIL(3);
Cx=Ax; Cy=Ay;
a=A>m;
b=B>m;
c=C>m;
for(n=0;n<Ax*Ay;n++)
*c++ = *a++ + *b++;
}
Keith Larson wrote:
Hi Bruno
Back when I was with TI I coded up this algorithm in ASM, but danged if
I dont have it at my finger tips! Luckily MMPY is pretty simple even
though C (and CPP) make it looks pretty trashy. Hopefully I get these
words right, but in C/CPP statically
defined X and Y dimensions are fine, but dynamic dimensions are
impossible since this would involve some kind of recomputation of base
and offset addresses. The bottom line is that you
cannot simply pass a pointer to a matrix, have *variable* X and Y
dimensions known, and simply calculate the row width or column height.
I needed this ability, so I came up with the following code.
typedef struct MATRIX
{
long x;
long y;
double m[MAXNODES_SQRD];
}_matrix_;
/*
++

++
 dimrow(a)==dimcol(b)
2x4
4x3 2x3
++
b b b
a a a a b b b = c c c
a a a a b b b c c c
b b
b
*/
void MMPY(MATRIX *C, MATRIX *A, MATRIX *B) // C=A*B;
{
long ax,ay,bx;
double S, *a, *b, *c;
if(
Ax!=By
) MFAIL(0);
if((Ax *By)>MAXNODES_SQRD) MFAIL(1);
Cx=Bx; Cy=Ay;
a=A>m;
c=C>m;
for(ay=0;ay<Ay;ay++)
{
for(bx=0;bx<Bx;bx++)
{
a = (A>m); a+=(ay*Ax);
b = (B>m); b+=bx;
S=0;
for(ax=0;ax<Ax;ax++)
{ S += *a++ * *b;
b+=Bx;
}
*c++ = S;
}
}
}
The nice thing about the DSP is that it has the ability to manipulate
addresses in this fashion with ease, not to mention the ability to read
and write multiple data items per cycle. And, if you look carefully, a
few things become redundant. In the end you should be able to create a
DSP MMPY that approaches X*Y cycles plus some nominal overhead. My
recollection is that for an arbitrary XY dimensioned matrix, this
would come out something like X*(Y+3)+6 cycles.
Hope this helps,
Keith Larson
Bruno wrote:
Hi to all,
I'm very interested in
matrix multiplication subject for the VC33. In the past, I
already had
to implement a matrix multiplication algorithm in assembler for the
VC33 (Extended Kalman Filter 30x30 matrices). I observed a large
improvement in the performance (comparing to C)! Does anyone have any
knowledge or documentation on this field? I implemented the matrix
multiplication for matrices stored in column wise fashion (as in
Fortran). Does anyone have any asm library for matrices calculations?
What about the way of storing the matrices in memory! Any comments or
information in this field is welcome! Even for C coding. I would like
to change some ideas or experience!
I'm using Code
Composer, but in terms of simulation and profiling, has a lot of bugs!
Opinions? alternatives?
_____________________________________
Note: If you do a simple "reply" with your email client, only the
author of this message will receive your answer. You need to do a
"reply all" if you want your answer to be distributed to the entire
group.
_____________________________________
About this discussion group:
_____________________________________
Note: If you do a simple "reply" with your email client, only the
author of this message will receive your answer. You need to do a
"reply all" if you want your answer to be distributed to the entire
group.
_____________________________________
About this discussion group: