Hi, I still try to optimize my code. But I stuck at LDDW/STDW intrinsics. I'm working with a TMS320C6713. This is the original code: --------------------------------- for (i=0;i<16;i++) { for (l=0;l<8;l++) { loop_in_array[i][l]=*coeff*r_in_array[0][l]; loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l]; } coeff+=8; } --------------------------------- This was my 1st improvement it speed things up to 300% --------------------------------- for (i=0;i<16;i++) { for (l=0;l<8;l+=2) { loop_in_array[i][l]=*coeff*r_in_array[0][l]; loop_in_array[i][l+1]=*coeff*r_in_array[0][l+1]; loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l]; loop_in_array[i][l+1]+=*(coeff+1)*r_in_array[1][l+1]; } coeff+=8; } --------------------------------- Then I tried those intrinsics. --------------------------------- for (i=0;i<16;i++) { for (l=0;l<8;l+=2) { float sum0, sum1; double r_in_array0, r_in_array1, coeff_d; r_in_array0 = * (double *)&r_in_array[0][l]; r_in_array1 = * (double *)&r_in_array[1][l]; coeff_d = * (double *)coeff; sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0)); sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0)); sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1)); sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1)); *((double *) &loop_in_array[i][l])= _itod((unsigned)sum0, (unsigned)sum1); } coeff+=8; } --------------------------------- It still got faster, but it doesn't do anymore what it is supposed to do. I cant see where I mixed things up. Can you? TIA Gustl |
|
Optimizing C Code & LDDW/STDW Intrinsics
Started by ●October 22, 2004
Reply by ●October 22, 20042004-10-22
Bernhard 'Gustl' Bauer wrote: > > Hi, > > I still try to optimize my code. But I stuck at LDDW/STDW intrinsics. > I'm working with a TMS320C6713. > > This is the original code: > --------------------------------- > for (i=0;i<16;i++) { > for (l=0;l<8;l++) { > loop_in_array[i][l]=*coeff*r_in_array[0][l]; > loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l]; > } > coeff+=8; > } > --------------------------------- > > This was my 1st improvement it speed things up to 300% > --------------------------------- > for (i=0;i<16;i++) { > for (l=0;l<8;l+=2) { > loop_in_array[i][l]=*coeff*r_in_array[0][l]; > loop_in_array[i][l+1]=*coeff*r_in_array[0][l+1]; > loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l]; > loop_in_array[i][l+1]+=*(coeff+1)*r_in_array[1][l+1]; > } > coeff+=8; > } > --------------------------------- > > Then I tried those intrinsics. > --------------------------------- > for (i=0;i<16;i++) { > for (l=0;l<8;l+=2) { > float sum0, sum1; > double r_in_array0, r_in_array1, coeff_d; > > r_in_array0 = * (double *)&r_in_array[0][l]; > r_in_array1 = * (double *)&r_in_array[1][l]; > coeff_d = * (double *)coeff; > sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0)); > sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0)); > sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1)); > sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1)); > *((double *) &loop_in_array[i][l])= > _itod((unsigned)sum0, (unsigned)sum1); > > } > coeff+=8; > } > --------------------------------- > It still got faster, but it doesn't do anymore what it is supposed to > do. I cant see where I mixed things up. Can you? > > TIA Gustl > Can you send me the C code so that I can compile and look at the assembly it is producing. I do not think you can load a double word from &r_in_array[0][1] as I supsect that it is not double word aligned. The LDDW assumes that the address is double word aligned. If this is not true, it will truncate the 3 lsb's to a double word-aligned address and load from there. This may be why the code is not working. You have to change your code by unrolling to make sure that you are always loading from a double word address.I feel your code should look more like: > for (i=0;i<16;i++) { > for (l=0;l<8;l+=2) { > float sum0, sum1; > double r_in_array0, r_in_array1, coeff_d; > > r_in_array0 = * (double *)&r_in_array[0][0]; > r_in_array1 = * (double *)&r_in_array[1][0]; > coeff_d = * (double *)coeff; > sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0)); > sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0)); > sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1)); > sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1)); > *((double *) &loop_in_array[i][l])= > _itod((unsigned)sum0, (unsigned)sum1); > > } > coeff+=8; > } Regds Jagadeesh Sankaran |
|
Reply by ●October 24, 20042004-10-24
Thomas Laiminger wrote: Hi Thomas, > have you tried to use #pragmas to give the compiler some infos about how he > can unroll the loop? I don't know what pragmas you mean. But I have unrolled the inner loop and it's still the same. > > In some cases this can bring some speed... Where is your stack located? Your > variables are allocated on the stack, if you have it in external memory this > might slow things down > My stack is of course in internal loop. Gustl |
Reply by ●October 24, 20042004-10-24
have you tried to use #pragmas to give the compiler some infos about how he can unroll the loop? In some cases this can bring some speed... Where is your stack located? Your variables are allocated on the stack, if you have it in external memory this might slow things down --- In , "Bernhard 'Gustl' Bauer" <gustl@q...> wrote: > > Hi, > > I still try to optimize my code. But I stuck at LDDW/STDW intrinsics. > I'm working with a TMS320C6713. > > This is the original code: > -------------------------------- - > for (i=0;i<16;i++) { > for (l=0;l<8;l++) { > loop_in_array[i][l]=*coeff*r_in_array[0][l]; > loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l]; > } > coeff+=8; > } > -------------------------------- - > > This was my 1st improvement it speed things up to 300% > -------------------------------- - > for (i=0;i<16;i++) { > for (l=0;l<8;l+=2) { > loop_in_array[i][l]=*coeff*r_in_array[0][l]; > loop_in_array[i][l+1]=*coeff*r_in_array[0][l+1]; > loop_in_array[i][l]+=*(coeff+1)*r_in_array[1][l]; > loop_in_array[i][l+1]+=*(coeff+1)*r_in_array[1][l+1]; > } > coeff+=8; > } > -------------------------------- - > > Then I tried those intrinsics. > -------------------------------- - > for (i=0;i<16;i++) { > for (l=0;l<8;l+=2) { > float sum0, sum1; > double r_in_array0, r_in_array1, coeff_d; > > r_in_array0 = * (double *)&r_in_array[0][l]; > r_in_array1 = * (double *)&r_in_array[1][l]; > coeff_d = * (double *)coeff; > sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0)); > sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0)); > sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1)); > sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1)); > *((double *) &loop_in_array[i][l])= > _itod((unsigned)sum0, (unsigned)sum1); > > } > coeff+=8; > } > -------------------------------- - > It still got faster, but it doesn't do anymore what it is supposed to > do. I cant see where I mixed things up. Can you? > > TIA Gustl |
Reply by ●October 24, 20042004-10-24
Jagadeesh Sankaran wrote: > Bernhard 'Gustl' Bauer wrote: > >>Then I tried those intrinsics. >>--------------------------------- >>for (i=0;i<16;i++) { >> for (l=0;l<8;l+=2) { >> float sum0, sum1; >> double r_in_array0, r_in_array1, coeff_d; >> >> r_in_array0 = * (double *)&r_in_array[0][l]; >> r_in_array1 = * (double *)&r_in_array[1][l]; >> coeff_d = * (double *)coeff; >> sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0)); >> sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0)); >> sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1)); >> sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1)); >> *((double *) &loop_in_array[i][l])= >> _itod((unsigned)sum0, (unsigned)sum1); >> >> } >> coeff+=8; >>} >>--------------------------------- >>It still got faster, but it doesn't do anymore what it is supposed to >>do. I cant see where I mixed things up. Can you? >> >>TIA Gustl > > Can you send me the C code so that I can compile and look at the > assembly I think this is the C code :-) I'm not allowed to give away the complete code. > it is producing. I do not think you can load a double word from > &r_in_array[0][1] > as I supsect that it is not double word aligned. The LDDW assumes that > the address > is double word aligned. If this is not true, it will truncate the 3 > lsb's to > a double word-aligned address and load from there. This may be why the > code is I have a pragma that alligns r_in_array and loop_in_array to 8. I will check for coeff when I'm in the office again. Because l can be 0,2,4 and 6 the 8 border should be valid for loop_in_array[i][l] as well. > not working. You have to change your code by unrolling to make sure that > you > are always loading from a double word address.I feel your code should > look > more like: I have unrolled the code and it's still the same. > > >>for (i=0;i<16;i++) { >> for (l=0;l<8;l+=2) { >> float sum0, sum1; >> double r_in_array0, r_in_array1, coeff_d; >> >> r_in_array0 = * (double *)&r_in_array[0][0]; >> r_in_array1 = * (double *)&r_in_array[1][0]; ^ I need l = 0,2,4,6 at this place! -------------------------| >> coeff_d = * (double *)coeff; >> sum0 = _itof(_lo(coeff_d)) * _itof(_lo(r_in_array0)); >> sum1 = _itof(_lo(coeff_d)) * _itof(_hi(r_in_array0)); >> sum0 += _itof(_hi(coeff_d)) * _itof(_lo(r_in_array1)); >> sum1 += _itof(_hi(coeff_d)) * _itof(_hi(r_in_array1)); >> *((double *) &loop_in_array[i][l])= >> _itod((unsigned)sum0, (unsigned)sum1); >> >> } >> coeff+=8; >>} Gustl |