DSPRelated.com
Forums

Optimizing C Code & LDDW/STDW Intrinsics

Started by Bernhard 'Gustl' Bauer October 22, 2004

Hi,

I still try to optimize my code. But I stuck at LDDW/STDW intrinsics.
I'm working with a TMS320C6713.

This is the original code:
---------------------------------
for (i=0;i<16;i++) {
for (l=0;l<8;l++) {
loop_in_array[i][l]=*coeff*r_in_array[0][l];
loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l];
}
coeff+=8;
}
---------------------------------

This was my 1st improvement it speed things up to 300%
---------------------------------
for (i=0;i<16;i++) {
for (l=0;l<8;l+=2) {
loop_in_array[i][l]=*coeff*r_in_array[0][l];
loop_in_array[i][l+1]=*coeff*r_in_array[0][l+1];
loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l];
loop_in_array[i][l+1]+=*(coeff+1)*r_in_array[1][l+1];
}
coeff+=8;
}
---------------------------------

Then I tried those intrinsics.
---------------------------------
for (i=0;i<16;i++) {
for (l=0;l<8;l+=2) {
float sum0, sum1;
double r_in_array0, r_in_array1, coeff_d;

r_in_array0 = * (double *)&r_in_array[0][l];
r_in_array1 = * (double *)&r_in_array[1][l];
coeff_d = * (double *)coeff;
sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0));
sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0));
sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1));
sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1));
*((double *) &loop_in_array[i][l])=
_itod((unsigned)sum0, (unsigned)sum1);

}
coeff+=8;
}
---------------------------------
It still got faster, but it doesn't do anymore what it is supposed to
do. I cant see where I mixed things up. Can you?

TIA Gustl





Bernhard 'Gustl' Bauer wrote:
>
> Hi,
>
> I still try to optimize my code. But I stuck at LDDW/STDW intrinsics.
> I'm working with a TMS320C6713.
>
> This is the original code:
> ---------------------------------
> for (i=0;i<16;i++) {
> for (l=0;l<8;l++) {
> loop_in_array[i][l]=*coeff*r_in_array[0][l];
> loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l];
> }
> coeff+=8;
> }
> ---------------------------------
>
> This was my 1st improvement it speed things up to 300%
> ---------------------------------
> for (i=0;i<16;i++) {
> for (l=0;l<8;l+=2) {
> loop_in_array[i][l]=*coeff*r_in_array[0][l];
> loop_in_array[i][l+1]=*coeff*r_in_array[0][l+1];
> loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l];
> loop_in_array[i][l+1]+=*(coeff+1)*r_in_array[1][l+1];
> }
> coeff+=8;
> }
> ---------------------------------
>
> Then I tried those intrinsics.
> ---------------------------------
> for (i=0;i<16;i++) {
> for (l=0;l<8;l+=2) {
> float sum0, sum1;
> double r_in_array0, r_in_array1, coeff_d;
>
> r_in_array0 = * (double *)&r_in_array[0][l];
> r_in_array1 = * (double *)&r_in_array[1][l];
> coeff_d = * (double *)coeff;
> sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0));
> sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0));
> sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1));
> sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1));
> *((double *) &loop_in_array[i][l])=
> _itod((unsigned)sum0, (unsigned)sum1);
>
> }
> coeff+=8;
> }
> ---------------------------------
> It still got faster, but it doesn't do anymore what it is supposed to
> do. I cant see where I mixed things up. Can you?
>
> TIA Gustl
>

Can you send me the C code so that I can compile and look at the
assembly
it is producing. I do not think you can load a double word from
&r_in_array[0][1]
as I supsect that it is not double word aligned. The LDDW assumes that
the address
is double word aligned. If this is not true, it will truncate the 3
lsb's to
a double word-aligned address and load from there. This may be why the
code is
not working. You have to change your code by unrolling to make sure that
you
are always loading from a double word address.I feel your code should
look
more like:

> for (i=0;i<16;i++) {
> for (l=0;l<8;l+=2) {
> float sum0, sum1;
> double r_in_array0, r_in_array1, coeff_d;
>
> r_in_array0 = * (double *)&r_in_array[0][0];
> r_in_array1 = * (double *)&r_in_array[1][0];
> coeff_d = * (double *)coeff;
> sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0));
> sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0));
> sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1));
> sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1));
> *((double *) &loop_in_array[i][l])=
> _itod((unsigned)sum0, (unsigned)sum1);
>
> }
> coeff+=8;
> }

Regds
Jagadeesh Sankaran





Thomas Laiminger wrote:

Hi Thomas,

> have you tried to use #pragmas to give the compiler some infos about how he
> can unroll the loop?

I don't know what pragmas you mean. But I have unrolled the inner loop
and it's still the same.
>
> In some cases this can bring some speed... Where is your stack located? Your
> variables are allocated on the stack, if you have it in external memory this
> might slow things down
>
My stack is of course in internal loop.

Gustl




have you tried to use #pragmas to give the compiler some infos about
how he
can unroll the loop?

In some cases this can bring some speed... Where is your stack
located? Your
variables are allocated on the stack, if you have it in external
memory this
might slow things down --- In , "Bernhard 'Gustl' Bauer" <gustl@q...>
wrote:
>
> Hi,
>
> I still try to optimize my code. But I stuck at LDDW/STDW
intrinsics.
> I'm working with a TMS320C6713.
>
> This is the original code:
> --------------------------------
-
> for (i=0;i<16;i++) {
> for (l=0;l<8;l++) {
> loop_in_array[i][l]=*coeff*r_in_array[0][l];
> loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l];
> }
> coeff+=8;
> }
> --------------------------------
-
>
> This was my 1st improvement it speed things up to 300%
> --------------------------------
-
> for (i=0;i<16;i++) {
> for (l=0;l<8;l+=2) {
> loop_in_array[i][l]=*coeff*r_in_array[0][l];
> loop_in_array[i][l+1]=*coeff*r_in_array[0][l+1];
> loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l];
> loop_in_array[i][l+1]+=*(coeff+1)*r_in_array[1][l+1];
> }
> coeff+=8;
> }
> --------------------------------
-
>
> Then I tried those intrinsics.
> --------------------------------
-
> for (i=0;i<16;i++) {
> for (l=0;l<8;l+=2) {
> float sum0, sum1;
> double r_in_array0, r_in_array1, coeff_d;
>
> r_in_array0 = * (double *)&r_in_array[0][l];
> r_in_array1 = * (double *)&r_in_array[1][l];
> coeff_d = * (double *)coeff;
> sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0));
> sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0));
> sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1));
> sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1));
> *((double *) &loop_in_array[i][l])=
> _itod((unsigned)sum0, (unsigned)sum1);

>
> }
> coeff+=8;
> }
> --------------------------------
-
> It still got faster, but it doesn't do anymore what it is supposed
to
> do. I cant see where I mixed things up. Can you?
>
> TIA Gustl




Jagadeesh Sankaran wrote:

> Bernhard 'Gustl' Bauer wrote:
>
>>Then I tried those intrinsics.
>>---------------------------------
>>for (i=0;i<16;i++) {
>> for (l=0;l<8;l+=2) {
>> float sum0, sum1;
>> double r_in_array0, r_in_array1, coeff_d;
>>
>> r_in_array0 = * (double *)&r_in_array[0][l];
>> r_in_array1 = * (double *)&r_in_array[1][l];
>> coeff_d = * (double *)coeff;
>> sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0));
>> sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0));
>> sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1));
>> sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1));
>> *((double *) &loop_in_array[i][l])=
>> _itod((unsigned)sum0, (unsigned)sum1);
>>
>> }
>> coeff+=8;
>>}
>>---------------------------------
>>It still got faster, but it doesn't do anymore what it is supposed to
>>do. I cant see where I mixed things up. Can you?
>>
>>TIA Gustl
> > Can you send me the C code so that I can compile and look at the
> assembly

I think this is the C code :-)
I'm not allowed to give away the complete code.

> it is producing. I do not think you can load a double word from
> &r_in_array[0][1]
> as I supsect that it is not double word aligned. The LDDW assumes that
> the address
> is double word aligned. If this is not true, it will truncate the 3
> lsb's to
> a double word-aligned address and load from there. This may be why the
> code is

I have a pragma that alligns r_in_array and loop_in_array to 8. I will
check for coeff when I'm in the office again. Because l can be 0,2,4 and
6 the 8 border should be valid for loop_in_array[i][l] as well.

> not working. You have to change your code by unrolling to make sure that
> you
> are always loading from a double word address.I feel your code should
> look
> more like:

I have unrolled the code and it's still the same.
>
>
>>for (i=0;i<16;i++) {
>> for (l=0;l<8;l+=2) {
>> float sum0, sum1;
>> double r_in_array0, r_in_array1, coeff_d;
>>
>> r_in_array0 = * (double *)&r_in_array[0][0];
>> r_in_array1 = * (double *)&r_in_array[1][0];
^
I need l = 0,2,4,6 at this place! -------------------------|

>> coeff_d = * (double *)coeff;
>> sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0));
>> sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0));
>> sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1));
>> sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1));
>> *((double *) &loop_in_array[i][l])=
>> _itod((unsigned)sum0, (unsigned)sum1);
>>
>> }
>> coeff+=8;
>>}

Gustl