|
Hello, I have written a C source code with CCS 2.2 which needs about 20% to much time :-( I tried improoving by optimized data structur, compiler options and all hints I found in the lst files. If anyone of you would know some good documentation .... :-) But I think I have to switch to assembler. I think of taking the CCS generated asm file and try to improove it. Has anybody done this? How much speed have you gained? Any pointers to documents? TIA Gustl |
|
|
optimizing C code /switching to asembler
Started by ●October 12, 2004
Reply by ●October 12, 20042004-10-12
|
Hi Gustl, Instead of using generated ASM file, you should try writing Linear ASM for critical cycle intensive modules (kernels). Please refer to TI manual in writing Linear ASM. Sometimes Compiler will generate better schedule for Linear ASM. Regards, Mihir Mody, Multimedia codecs group, Texas Instruments India, Ltd, Email : Phone : +91-80-25099307 -----Original Message----- From: Bernhard 'Gustl' Bauer [mailto:] Sent: Tuesday, October 12, 2004 10:44 AM To: C6x Subject: [c6x] optimizing C code /switching to asembler Hello, I have written a C source code with CCS 2.2 which needs about 20% to much time :-( I tried improoving by optimized data structur, compiler options and all hints I found in the lst files. If anyone of you would know some good documentation .... :-) But I think I have to switch to assembler. I think of taking the CCS generated asm file and try to improove it. Has anybody done this? How much speed have you gained? Any pointers to documents? TIA Gustl _____________________________________ Note: If you do a simple "reply" with your email client, only the author of this message will receive your answer. You need to do a "reply all" if you want your answer to be distributed to the entire group. _____________________________________ About this discussion group: To Join: Send an email to To Post: Send an email to To Leave: Send an email to Archives: http://www.yahoogroups.com/group/c6x Other Groups: http://www.dsprelated.com Yahoo! Groups Links |
|
|
Reply by ●October 12, 20042004-10-12
|
Bernhard 'Gustl' Bauer wrote: > > Hello, > > I have written a C source code with CCS 2.2 which needs about 20% to > much time :-( I tried improoving by optimized data structur, compiler > options and all hints I found in the lst files. If anyone of you would > know some good documentation .... :-) > > But I think I have to switch to assembler. I think of taking the CCS > generated asm file and try to improove it. Has anybody done this? How > much speed have you gained? Any pointers to documents? > > TIA Gustl > Can you share the code, or some snippets of it so that I can offer more productive advice. Regds Jagadeesh Sankaran |
|
|
Reply by ●October 12, 20042004-10-12
|
I would try writing C with instrinsics first. TI intrinsics are very easy to incorporate into C-code, and I have seen very good improvements. Check out "Optimizing C compilers Guide" - SPRU187L.pdf from dspvillage.ti.com If you have already done that, Mihir's suggestion would be next. You might also want to check out a post from Jagadeesh Sankaran on this board quite a while ago: http://groups.yahoo.com/group/c6x/message/1283 ~ka --- In , "Mody, Mihir" <mihir@t...> wrote: > > Hi Gustl, > > Instead of using generated ASM file, you should try writing Linear ASM for critical cycle intensive modules (kernels). Please refer to TI manual in writing Linear ASM. Sometimes Compiler will generate better schedule for Linear ASM. > > Regards, > Mihir Mody, > Multimedia codecs group, > Texas Instruments India, Ltd, > Email : mihir@t... > Phone : +91-80-25099307 > -----Original Message----- > From: Bernhard 'Gustl' Bauer [mailto:gustl@q...] > Sent: Tuesday, October 12, 2004 10:44 AM > To: C6x > Subject: [c6x] optimizing C code /switching to asembler > > > Hello, > > I have written a C source code with CCS 2.2 which needs about 20% to > much time :-( I tried improoving by optimized data structur, compiler > options and all hints I found in the lst files. If anyone of you would > know some good documentation .... :-) > > But I think I have to switch to assembler. I think of taking the CCS > generated asm file and try to improove it. Has anybody done this? How > much speed have you gained? Any pointers to documents? > > TIA Gustl > > > _____________________________________ > Note: If you do a simple "reply" with your email client, only the author of this message will receive your answer. You need to do a "reply all" if you want your answer to be distributed to the entire group. > > _____________________________________ > About this discussion group: > > To Join: Send an email to > > To Post: Send an email to > > To Leave: Send an email to > > Archives: http://www.yahoogroups.com/group/c6x > > Other Groups: http://www.dsprelated.com > > Yahoo! Groups Links |
|
|
Reply by ●October 12, 20042004-10-12
|
Gustl, Is the compiler generating a pipelined loop ? Turn on ASM output and have a look. I assume it is some sort loop that is taking too long ? I would second Jagadeesh's comment - can we see the code ? The TI C compiler is actually very good at optimization if you can get everything organized the right way. - Andrew At 07:23 PM 10/12/2004 +0530, Mody, Mihir wrote: >Hi Gustl, > >Instead of using generated ASM file, you should try writing Linear ASM for critical cycle intensive modules (kernels). Please refer to TI manual in writing Linear ASM. Sometimes Compiler will generate better schedule for Linear ASM. > >Regards, >Mihir Mody, >Multimedia codecs group, >Texas Instruments India, Ltd, >Email : >Phone : +91-80-25099307 >-----Original Message----- >From: Bernhard 'Gustl' Bauer [mailto:] >Sent: Tuesday, October 12, 2004 10:44 AM >To: C6x >Subject: [c6x] optimizing C code /switching to asembler > > >Hello, > >I have written a C source code with CCS 2.2 which needs about 20% to >much time :-( I tried improoving by optimized data structur, compiler >options and all hints I found in the lst files. If anyone of you would >know some good documentation .... :-) > >But I think I have to switch to assembler. I think of taking the CCS >generated asm file and try to improove it. Has anybody done this? How >much speed have you gained? Any pointers to documents? > >TIA Gustl > > >_____________________________________ >Note: If you do a simple "reply" with your email client, only the author of this message will receive your answer. You need to do a "reply all" if you want your answer to be distributed to the entire group. > >_____________________________________ >About this discussion group: > >To Join: Send an email to > >To Post: Send an email to > >To Leave: Send an email to > >Archives: http://www.yahoogroups.com/group/c6x > >Other Groups: http://www.dsprelated.com > >Yahoo! Groups Links > >_____________________________________ >Note: If you do a simple "reply" with your email client, only the author of this message will receive your answer. You need to do a "reply all" if you want your answer to be distributed to the entire group. > >_____________________________________ >About this discussion group: > >To Join: Send an email to > >To Post: Send an email to > >To Leave: Send an email to > >Archives: http://www.yahoogroups.com/group/c6x > >Other Groups: http://www.dsprelated.com > >Yahoo! Groups Links > > Regards, Andrew Elder ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ AudioScience, Inc. (Rochester Branch) 274 N. Goodman Street, Suite B260A, Box 64, Rochester, NY 14607 ph (1) (585) 271-8870 fax (1) (585) 271-5853 <www.audioscience.com> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
Reply by ●October 13, 20042004-10-13
|
Jagadeesh Sankaran wrote: > Bernhard 'Gustl' Bauer wrote: > >> I have written a C source code with CCS 2.2 which needs about 20% >> to much time :-( I tried improoving by optimized data structur, >> compiler options and all hints I found in the lst files. If anyone >> of you would know some good documentation .... :-) >> >> But I think I have to switch to assembler. I think of taking the >> CCS generated asm file and try to improove it. Has anybody done >> this? How much speed have you gained? Any pointers to documents? >> > > Can you share the code, or some snippets of it so that I can offer > more productive advice. Unfortunately my company wouldn't allow this for the complete code. But I can show you a snippet: unsigned int i,j,l; unsigned int idx[6]; float out_array[6][8]; float in_array[6][8]; float fil[6][16]; for (l=0;l<8;l++) { coeff=coeff_h; for (i=0;i<6;i++) { out_array[i][l]=0; for (j=0;j<15;j++) { out_array[i][l]+=fil[i][(j+idx[i])&0xF]* *coeff++; } fil[i][idx[i]++]=in_array[i][l]; idx[i]&=0xF; } } I have a lot of constructions like this: a loop in a loop in a loop with 2-dim arrays. Gustl |
|
|
Reply by ●October 13, 20042004-10-13
|
Anand K wrote: > > I would try writing C with instrinsics first. TI intrinsics are very > easy to incorporate into C-code, and I have seen very good > improvements. Check out "Optimizing C compilers Guide" - SPRU187L.pdf > from dspvillage.ti.com I will have a closer look at the intrinsics. I've already a few ideas where they can be useful. > If you have already done that, Mihir's suggestion would be next. You > might also want to check out a post from Jagadeesh Sankaran on this > board quite a while ago: > > http://groups.yahoo.com/group/c6x/message/1283 This looks very promising. I'll go over it. Thanks a lot Gustl |
Reply by ●October 13, 20042004-10-13
|
Anand K wrote: > If you have already done that, Mihir's suggestion would be next. You > might also want to check out a post from Jagadeesh Sankaran on this > board quite a while ago: > > http://groups.yahoo.com/group/c6x/message/1283 I have just been over this an played with the compiler options. Jagadeesh Sankaran wrote: > Rememeber using -g , automatically slows doen code, because not all > the adavanced optimizations can be done and this slows one down by > 10-15%. So I imidiately switched off debug mode! The result was my IRQ routine needed about 6% _more_ time the before! Can anybody explain this? I've CCS 2.2, I have these options set: (-g) -k -q -s -al -os -o3 -mt -mw -ml3 -mv6710 Gustl |
Reply by ●October 13, 20042004-10-13
|
Bernhard, You could try something like the following. You may be able to convert to LDDW operations (via intrinsics) if array address accesses allow it (every LDDW has to be on an 8 byte boundary). I have sometimes re-organized some coeff arrays specifically for that purpose. for (l=0;l<8;l++) { coeff=coeff_h; for (i=0;i<6;i++) { float sum0=0; float sum1=0; int idxx=idx[i]; for (j=0;j<15;j+=2) { sum0+=fil[i][(j+idxx)&0xF] * *coeff++; sum1+=fil[i][(j+1+idxx)&0xF] * *coeff++; } sum0+=fil[i][(15+idxx)&0xF] * *coeff++; out_array[i][l]+=sum0+sum1; fil[i][idx[i]++]=in_array[i][l]; idx[i]&=0xF; } } BTW, I haven't tested the above. It may contain bugs.... - Andrew At 07:51 AM 10/13/2004 +0200, Bernhard 'Gustl' Bauer wrote: >Jagadeesh Sankaran wrote: > >> Bernhard 'Gustl' Bauer wrote: >> >>> I have written a C source code with CCS 2.2 which needs about 20% >>> to much time :-( I tried improoving by optimized data structur, >>> compiler options and all hints I found in the lst files. If anyone >>> of you would know some good documentation .... :-) >>> >>> But I think I have to switch to assembler. I think of taking the >>> CCS generated asm file and try to improove it. Has anybody done >>> this? How much speed have you gained? Any pointers to documents? >>> >> >> Can you share the code, or some snippets of it so that I can offer >> more productive advice. > >Unfortunately my company wouldn't allow this for the complete code. But >I can show you a snippet: > >unsigned int i,j,l; >unsigned int idx[6]; >float out_array[6][8]; >float in_array[6][8]; >float fil[6][16]; > >for (l=0;l<8;l++) { > coeff=coeff_h; > for (i=0;i<6;i++) { > out_array[i][l]=0; > for (j=0;j<15;j++) { > out_array[i][l]+=fil[i][(j+idx[i])&0xF]* *coeff++; > } > fil[i][idx[i]++]=in_array[i][l]; > idx[i]&=0xF; > } >} > >I have a lot of constructions like this: a loop in a loop in a loop with >2-dim arrays. > >Gustl > >_____________________________________ >Note: If you do a simple "reply" with your email client, only the author of this message will receive your answer. You need to do a "reply all" if you want your answer to be distributed to the entire group. > >_____________________________________ >About this discussion group: > >To Join: Send an email to > >To Post: Send an email to > >To Leave: Send an email to > >Archives: http://www.yahoogroups.com/group/c6x > >Other Groups: http://www.dsprelated.com > >Yahoo! Groups Links > > Regards, Andrew Elder ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ AudioScience, Inc. (Rochester Branch) 274 N. Goodman Street, Suite B260A, Box 64, Rochester, NY 14607 ph (1) (585) 271-8870 fax (1) (585) 271-5853 <www.audioscience.com> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
Reply by ●October 14, 20042004-10-14
|
Hi there, First of all, you need to figure out what is the bottleneck of your codes. Check the generated assembly code to find out. It seems your codes have several limits, so suppose 20% speed up should be no problem. The limits include, load and store, multiplication, branch delay. What I can suggest are: 1. Do what Andraw suggested. Use LDDW and STDW to reduce the pressure of load/store, and check the assembly to make sure TI compiler is using a register to store the loop result instead of store instruction is used in each iteration. 2. Transform your loop TI compiler is doing great at software pipeline, but it takes quite poor performance for exploiting instruction parallelism and reducing branch delay, comparing to trimedia compiler. CCS3.0 is doing a little better. Go to check the assembly code to make sure if the software pipeline is deployed for your code. If software pipeline is already enabled, you don't have to try the following stuff. You may put the j loop as an external loop, move the i loop into the internal loop and unroll it totally. Like: for( l ) { for( j ) { //i=0; sum0 += .. //i=1 sum1 += .. //i=2 sum2 += .. ... } // store sum0, sum1,... } Then you will have only a 2 level loop. 3. Change the 2-D array to 1D array And then increment the pointer after each loop. It should save some instructions for constructing the address. Check the assembly for the answer. 4. Use in_array directly It seems fil can be get rid of at a glance. 5. Try instrincs For example, EXTU may save instructions to load 0xF. 6. Algorithm optimization Check if some fast algorithm can be used, such as FFT. Good luck. Quentin --- Andrew Elder <> wrote: > > > Bernhard, > > You could try something like the following. > You may be able to convert to LDDW operations (via > intrinsics) if array address accesses allow it > (every LDDW has to be on an 8 byte boundary). I have > sometimes re-organized some coeff arrays > specifically for that purpose. > > for (l=0;l<8;l++) { > coeff=coeff_h; > for (i=0;i<6;i++) { > float sum0=0; > float sum1=0; > int idxx=idx[i]; > for (j=0;j<15;j+=2) { > sum0+=fil[i][(j+idxx)&0xF] * *coeff++; > sum1+=fil[i][(j+1+idxx)&0xF] * *coeff++; > } > sum0+=fil[i][(15+idxx)&0xF] * *coeff++; > out_array[i][l]+=sum0+sum1; > fil[i][idx[i]++]=in_array[i][l]; > idx[i]&=0xF; > } > } > > BTW, I haven't tested the above. It may contain > bugs.... > > - Andrew > At 07:51 AM 10/13/2004 +0200, Bernhard 'Gustl' Bauer > wrote: > >Jagadeesh Sankaran wrote: > > > >> Bernhard 'Gustl' Bauer wrote: > >> > >>> I have written a C source code with CCS 2.2 > which needs about 20% > >>> to much time :-( I tried improoving by optimized > data structur, > >>> compiler options and all hints I found in the > lst files. If anyone > >>> of you would know some good documentation .... > :-) > >>> > >>> But I think I have to switch to assembler. I > think of taking the > >>> CCS generated asm file and try to improove it. > Has anybody done > >>> this? How much speed have you gained? Any > pointers to documents? > >>> > >> > >> Can you share the code, or some snippets of it so > that I can offer > >> more productive advice. > > > >Unfortunately my company wouldn't allow this for > the complete code. But > >I can show you a snippet: > > > >unsigned int i,j,l; > >unsigned int idx[6]; > >float out_array[6][8]; > >float in_array[6][8]; > >float fil[6][16]; > > > >for (l=0;l<8;l++) { > > coeff=coeff_h; > > for (i=0;i<6;i++) { > > out_array[i][l]=0; > > for (j=0;j<15;j++) { > > out_array[i][l]+=fil[i][(j+idx[i])&0xF]* > *coeff++; > > } > > fil[i][idx[i]++]=in_array[i][l]; > > idx[i]&=0xF; > > } > >} > > > >I have a lot of constructions like this: a loop in > a loop in a loop with > >2-dim arrays. > > > >Gustl > > > > > > > > > > > > > >_____________________________________ > >Note: If you do a simple "reply" with your email > client, only the author of this message will receive > your answer. You need to do a "reply all" if you > want your answer to be distributed to the entire > group. > > > >_____________________________________ > >About this discussion group: > > > >To Join: Send an email to > > > > >To Post: Send an email to > > > >To Leave: Send an email to > > > > >Archives: http://www.yahoogroups.com/group/c6x > > > >Other Groups: http://www.dsprelated.com > > > >Yahoo! Groups Links > > > > > > > > > > Regards, > Andrew Elder > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > AudioScience, Inc. (Rochester Branch) > 274 N. Goodman Street, Suite B260A, Box 64, > Rochester, NY 14607 > ph (1) (585) 271-8870 > fax (1) (585) 271-5853 > <www.audioscience.com> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > _____________________________________ > Note: If you do a simple "reply" with your email > client, only the author of this message will receive > your answer. You need to do a "reply all" if you > want your answer to be distributed to the entire > group. > > _____________________________________ > About this discussion group: > > To Join: Send an email to > To Post: Send an email to > > To Leave: Send an email to > Archives: http://www.yahoogroups.com/group/c6x > > Other Groups: http://www.dsprelated.com > > Yahoo! Groups Links > __________________________________________________ |
|
|






