DSPRelated.com
Forums

optimizing C code /switching to asembler

Started by Bernhard 'Gustl' Bauer October 12, 2004

Hello,

I have written a C source code with CCS 2.2 which needs about 20% to
much time :-( I tried improoving by optimized data structur, compiler
options and all hints I found in the lst files. If anyone of you would
know some good documentation .... :-)

But I think I have to switch to assembler. I think of taking the CCS
generated asm file and try to improove it. Has anybody done this? How
much speed have you gained? Any pointers to documents?

TIA Gustl




Hi Gustl,

Instead of using generated ASM file, you should try writing Linear ASM for
critical cycle intensive modules (kernels). Please refer to TI manual in writing
Linear ASM. Sometimes Compiler will generate better schedule for Linear ASM.

Regards,
Mihir Mody,
Multimedia codecs group,
Texas Instruments India, Ltd,
Email :
Phone : +91-80-25099307 -----Original Message-----
From: Bernhard 'Gustl' Bauer [mailto:]
Sent: Tuesday, October 12, 2004 10:44 AM
To: C6x
Subject: [c6x] optimizing C code /switching to asembler

Hello,

I have written a C source code with CCS 2.2 which needs about 20% to
much time :-( I tried improoving by optimized data structur, compiler
options and all hints I found in the lst files. If anyone of you would
know some good documentation .... :-)

But I think I have to switch to assembler. I think of taking the CCS
generated asm file and try to improove it. Has anybody done this? How
much speed have you gained? Any pointers to documents?

TIA Gustl

_____________________________________
Note: If you do a simple "reply" with your email client, only the author of this
message will receive your answer. You need to do a "reply all" if you want your
answer to be distributed to the entire group.

_____________________________________
About this discussion group:

To Join: Send an email to

To Post: Send an email to

To Leave: Send an email to

Archives: http://www.yahoogroups.com/group/c6x

Other Groups: http://www.dsprelated.com

Yahoo! Groups Links




Bernhard 'Gustl' Bauer wrote:
>
> Hello,
>
> I have written a C source code with CCS 2.2 which needs about 20% to
> much time :-( I tried improoving by optimized data structur, compiler
> options and all hints I found in the lst files. If anyone of you would
> know some good documentation .... :-)
>
> But I think I have to switch to assembler. I think of taking the CCS
> generated asm file and try to improove it. Has anybody done this? How
> much speed have you gained? Any pointers to documents?
>
> TIA Gustl
>

Can you share the code, or some snippets of it so that I can offer more
productive advice.

Regds
Jagadeesh Sankaran




I would try writing C with instrinsics first. TI intrinsics are very
easy to incorporate into C-code, and I have seen very good
improvements. Check out "Optimizing C compilers Guide" - SPRU187L.pdf
from dspvillage.ti.com

If you have already done that, Mihir's suggestion would be next. You
might also want to check out a post from Jagadeesh Sankaran on this
board quite a while ago:

http://groups.yahoo.com/group/c6x/message/1283

~ka

--- In , "Mody, Mihir" <mihir@t...> wrote:
>
> Hi Gustl,
>
> Instead of using generated ASM file, you should try writing Linear
ASM for critical cycle intensive modules (kernels). Please refer to
TI manual in writing Linear ASM. Sometimes Compiler will generate
better schedule for Linear ASM.
>
> Regards,
> Mihir Mody,
> Multimedia codecs group,
> Texas Instruments India, Ltd,
> Email : mihir@t...
> Phone : +91-80-25099307 > -----Original Message-----
> From: Bernhard 'Gustl' Bauer [mailto:gustl@q...]
> Sent: Tuesday, October 12, 2004 10:44 AM
> To: C6x
> Subject: [c6x] optimizing C code /switching to asembler >
>
> Hello,
>
> I have written a C source code with CCS 2.2 which needs about 20%
to
> much time :-( I tried improoving by optimized data structur,
compiler
> options and all hints I found in the lst files. If anyone of you
would
> know some good documentation .... :-)
>
> But I think I have to switch to assembler. I think of taking the
CCS
> generated asm file and try to improove it. Has anybody done this?
How
> much speed have you gained? Any pointers to documents?
>
> TIA Gustl >
>
> _____________________________________
> Note: If you do a simple "reply" with your email client, only the
author of this message will receive your answer. You need to do
a "reply all" if you want your answer to be distributed to the entire
group.
>
> _____________________________________
> About this discussion group:
>
> To Join: Send an email to
>
> To Post: Send an email to
>
> To Leave: Send an email to
>
> Archives: http://www.yahoogroups.com/group/c6x
>
> Other Groups: http://www.dsprelated.com
>
> Yahoo! Groups Links





Gustl,

Is the compiler generating a pipelined loop ?

Turn on ASM output and have a look.

I assume it is some sort loop that is taking too long ?

I would second Jagadeesh's comment - can we see the code ?

The TI C compiler is actually very good at optimization if you can get
everything organized the right way.

- Andrew At 07:23 PM 10/12/2004 +0530, Mody, Mihir wrote: >Hi Gustl,
>
>Instead of using generated ASM file, you should try writing Linear ASM for
critical cycle intensive modules (kernels). Please refer to TI manual in writing
Linear ASM. Sometimes Compiler will generate better schedule for Linear ASM.
>
>Regards,
>Mihir Mody,
>Multimedia codecs group,
>Texas Instruments India, Ltd,
>Email :
>Phone : +91-80-25099307 >-----Original Message-----
>From: Bernhard 'Gustl' Bauer [mailto:]
>Sent: Tuesday, October 12, 2004 10:44 AM
>To: C6x
>Subject: [c6x] optimizing C code /switching to asembler >
>
>Hello,
>
>I have written a C source code with CCS 2.2 which needs about 20% to
>much time :-( I tried improoving by optimized data structur, compiler
>options and all hints I found in the lst files. If anyone of you would
>know some good documentation .... :-)
>
>But I think I have to switch to assembler. I think of taking the CCS
>generated asm file and try to improove it. Has anybody done this? How
>much speed have you gained? Any pointers to documents?
>
>TIA Gustl >
>
>_____________________________________
>Note: If you do a simple "reply" with your email client, only the author of
this message will receive your answer. You need to do a "reply all" if you want
your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com
>
>Yahoo! Groups Links >
>_____________________________________
>Note: If you do a simple "reply" with your email client, only the author of
this message will receive your answer. You need to do a "reply all" if you want
your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com
>
>Yahoo! Groups Links >
>

Regards,
Andrew Elder
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AudioScience, Inc. (Rochester Branch)
274 N. Goodman Street, Suite B260A, Box 64,
Rochester, NY 14607
ph (1) (585) 271-8870
fax (1) (585) 271-5853
<www.audioscience.com>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~



Jagadeesh Sankaran wrote:

> Bernhard 'Gustl' Bauer wrote:
>
>> I have written a C source code with CCS 2.2 which needs about 20%
>> to much time :-( I tried improoving by optimized data structur,
>> compiler options and all hints I found in the lst files. If anyone
>> of you would know some good documentation .... :-)
>>
>> But I think I have to switch to assembler. I think of taking the
>> CCS generated asm file and try to improove it. Has anybody done
>> this? How much speed have you gained? Any pointers to documents?
>>
>
> Can you share the code, or some snippets of it so that I can offer
> more productive advice.

Unfortunately my company wouldn't allow this for the complete code. But
I can show you a snippet:

unsigned int i,j,l;
unsigned int idx[6];
float out_array[6][8];
float in_array[6][8];
float fil[6][16];

for (l=0;l<8;l++) {
coeff=coeff_h;
for (i=0;i<6;i++) {
out_array[i][l]=0;
for (j=0;j<15;j++) {
out_array[i][l]+=fil[i][(j+idx[i])&0xF]* *coeff++;
}
fil[i][idx[i]++]=in_array[i][l];
idx[i]&=0xF;
}
}

I have a lot of constructions like this: a loop in a loop in a loop with
2-dim arrays.

Gustl




Anand K wrote:
>
> I would try writing C with instrinsics first. TI intrinsics are very
> easy to incorporate into C-code, and I have seen very good
> improvements. Check out "Optimizing C compilers Guide" - SPRU187L.pdf
> from dspvillage.ti.com

I will have a closer look at the intrinsics. I've already a few ideas
where they can be useful.

> If you have already done that, Mihir's suggestion would be next. You
> might also want to check out a post from Jagadeesh Sankaran on this
> board quite a while ago:
>
> http://groups.yahoo.com/group/c6x/message/1283

This looks very promising. I'll go over it.

Thanks a lot

Gustl



Anand K wrote:

> If you have already done that, Mihir's suggestion would be next. You
> might also want to check out a post from Jagadeesh Sankaran on this
> board quite a while ago:
>
> http://groups.yahoo.com/group/c6x/message/1283

I have just been over this an played with the compiler options.

Jagadeesh Sankaran wrote:
> Rememeber using -g , automatically slows doen code, because not all
> the adavanced optimizations can be done and this slows one down by
> 10-15%.

So I imidiately switched off debug mode! The result was my IRQ routine
needed about 6% _more_ time the before! Can anybody explain this?

I've CCS 2.2, I have these options set:
(-g) -k -q -s -al -os -o3 -mt -mw -ml3 -mv6710

Gustl




Bernhard,

You could try something like the following.
You may be able to convert to LDDW operations (via intrinsics) if array address
accesses allow it (every LDDW has to be on an 8 byte boundary). I have sometimes
re-organized some coeff arrays specifically for that purpose.

for (l=0;l<8;l++) {
coeff=coeff_h;
for (i=0;i<6;i++) {
float sum0=0;
float sum1=0;
int idxx=idx[i];
for (j=0;j<15;j+=2) {
sum0+=fil[i][(j+idxx)&0xF] * *coeff++;
sum1+=fil[i][(j+1+idxx)&0xF] * *coeff++;
}
sum0+=fil[i][(15+idxx)&0xF] * *coeff++;
out_array[i][l]+=sum0+sum1;
fil[i][idx[i]++]=in_array[i][l];
idx[i]&=0xF;
}
}

BTW, I haven't tested the above. It may contain bugs....

- Andrew At 07:51 AM 10/13/2004 +0200, Bernhard 'Gustl' Bauer wrote: >Jagadeesh Sankaran wrote:
>
>> Bernhard 'Gustl' Bauer wrote:
>>
>>> I have written a C source code with CCS 2.2 which needs about 20%
>>> to much time :-( I tried improoving by optimized data structur,
>>> compiler options and all hints I found in the lst files. If anyone
>>> of you would know some good documentation .... :-)
>>>
>>> But I think I have to switch to assembler. I think of taking the
>>> CCS generated asm file and try to improove it. Has anybody done
>>> this? How much speed have you gained? Any pointers to documents?
>>>
>>
>> Can you share the code, or some snippets of it so that I can offer
>> more productive advice.
>
>Unfortunately my company wouldn't allow this for the complete code. But
>I can show you a snippet:
>
>unsigned int i,j,l;
>unsigned int idx[6];
>float out_array[6][8];
>float in_array[6][8];
>float fil[6][16];
>
>for (l=0;l<8;l++) {
> coeff=coeff_h;
> for (i=0;i<6;i++) {
> out_array[i][l]=0;
> for (j=0;j<15;j++) {
> out_array[i][l]+=fil[i][(j+idx[i])&0xF]* *coeff++;
> }
> fil[i][idx[i]++]=in_array[i][l];
> idx[i]&=0xF;
> }
>}
>
>I have a lot of constructions like this: a loop in a loop in a loop with
>2-dim arrays.
>
>Gustl >
>_____________________________________
>Note: If you do a simple "reply" with your email client, only the author of
this message will receive your answer. You need to do a "reply all" if you want
your answer to be distributed to the entire group.
>
>_____________________________________
>About this discussion group:
>
>To Join: Send an email to
>
>To Post: Send an email to
>
>To Leave: Send an email to
>
>Archives: http://www.yahoogroups.com/group/c6x
>
>Other Groups: http://www.dsprelated.com
>
>Yahoo! Groups Links >
>

Regards,
Andrew Elder
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
AudioScience, Inc. (Rochester Branch)
274 N. Goodman Street, Suite B260A, Box 64,
Rochester, NY 14607
ph (1) (585) 271-8870
fax (1) (585) 271-5853
<www.audioscience.com>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~





Hi there,

First of all, you need to figure out what is the
bottleneck of your codes. Check the generated assembly
code to find out. It seems your codes have several
limits, so suppose 20% speed up should be no problem.

The limits include, load and store, multiplication,
branch delay. What I can suggest are:

1. Do what Andraw suggested.
Use LDDW and STDW to reduce the pressure of
load/store, and check the assembly to make sure TI
compiler is using a register to store the loop result
instead of store instruction is used in each
iteration.

2. Transform your loop
TI compiler is doing great at software pipeline, but
it takes quite poor performance for exploiting
instruction parallelism and reducing branch delay,
comparing to trimedia compiler. CCS3.0 is doing a
little better.
Go to check the assembly code to make sure if the
software pipeline is deployed for your code. If
software pipeline is already enabled, you don't have
to try the following stuff.
You may put the j loop as an external loop, move the i
loop into the internal loop and unroll it totally.
Like:
for( l )
{
for( j )
{
//i=0;
sum0 += ..
//i=1
sum1 += ..
//i=2
sum2 += ..
...
}
// store sum0, sum1,...
}
Then you will have only a 2 level loop.

3. Change the 2-D array to 1D array
And then increment the pointer after each loop. It
should save some instructions for constructing the
address. Check the assembly for the answer.

4. Use in_array directly
It seems fil can be get rid of at a glance. 5. Try instrincs
For example, EXTU may save instructions to load 0xF.

6. Algorithm optimization
Check if some fast algorithm can be used, such as FFT. Good luck.
Quentin

--- Andrew Elder <> wrote:

>
>
> Bernhard,
>
> You could try something like the following.
> You may be able to convert to LDDW operations (via
> intrinsics) if array address accesses allow it
> (every LDDW has to be on an 8 byte boundary). I have
> sometimes re-organized some coeff arrays
> specifically for that purpose.
>
> for (l=0;l<8;l++) {
> coeff=coeff_h;
> for (i=0;i<6;i++) {
> float sum0=0;
> float sum1=0;
> int idxx=idx[i];
> for (j=0;j<15;j+=2) {
> sum0+=fil[i][(j+idxx)&0xF] * *coeff++;
> sum1+=fil[i][(j+1+idxx)&0xF] * *coeff++;
> }
> sum0+=fil[i][(15+idxx)&0xF] * *coeff++;
> out_array[i][l]+=sum0+sum1;
> fil[i][idx[i]++]=in_array[i][l];
> idx[i]&=0xF;
> }
> }
>
> BTW, I haven't tested the above. It may contain
> bugs....
>
> - Andrew > At 07:51 AM 10/13/2004 +0200, Bernhard 'Gustl' Bauer
> wrote: > >Jagadeesh Sankaran wrote:
> >
> >> Bernhard 'Gustl' Bauer wrote:
> >>
> >>> I have written a C source code with CCS 2.2
> which needs about 20%
> >>> to much time :-( I tried improoving by optimized
> data structur,
> >>> compiler options and all hints I found in the
> lst files. If anyone
> >>> of you would know some good documentation ....
> :-)
> >>>
> >>> But I think I have to switch to assembler. I
> think of taking the
> >>> CCS generated asm file and try to improove it.
> Has anybody done
> >>> this? How much speed have you gained? Any
> pointers to documents?
> >>>
> >>
> >> Can you share the code, or some snippets of it so
> that I can offer
> >> more productive advice.
> >
> >Unfortunately my company wouldn't allow this for
> the complete code. But
> >I can show you a snippet:
> >
> >unsigned int i,j,l;
> >unsigned int idx[6];
> >float out_array[6][8];
> >float in_array[6][8];
> >float fil[6][16];
> >
> >for (l=0;l<8;l++) {
> > coeff=coeff_h;
> > for (i=0;i<6;i++) {
> > out_array[i][l]=0;
> > for (j=0;j<15;j++) {
> > out_array[i][l]+=fil[i][(j+idx[i])&0xF]*
> *coeff++;
> > }
> > fil[i][idx[i]++]=in_array[i][l];
> > idx[i]&=0xF;
> > }
> >}
> >
> >I have a lot of constructions like this: a loop in
> a loop in a loop with
> >2-dim arrays.
> >
> >Gustl
> >
> >
> >
> >
> >
> >
> >_____________________________________
> >Note: If you do a simple "reply" with your email
> client, only the author of this message will receive
> your answer. You need to do a "reply all" if you
> want your answer to be distributed to the entire
> group.
> >
> >_____________________________________
> >About this discussion group:
> >
> >To Join: Send an email to
>
> >
> >To Post: Send an email to
> >
> >To Leave: Send an email to
>
> >
> >Archives: http://www.yahoogroups.com/group/c6x
> >
> >Other Groups: http://www.dsprelated.com
> >
> >Yahoo! Groups Links
> >
> >
> >
> >
>
> Regards,
> Andrew Elder
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> AudioScience, Inc. (Rochester Branch)
> 274 N. Goodman Street, Suite B260A, Box 64,
> Rochester, NY 14607
> ph (1) (585) 271-8870
> fax (1) (585) 271-5853
> <www.audioscience.com>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >
> _____________________________________
> Note: If you do a simple "reply" with your email
> client, only the author of this message will receive
> your answer. You need to do a "reply all" if you
> want your answer to be distributed to the entire
> group.
>
> _____________________________________
> About this discussion group:
>
> To Join: Send an email to > To Post: Send an email to
>
> To Leave: Send an email to > Archives: http://www.yahoogroups.com/group/c6x
>
> Other Groups: http://www.dsprelated.com
>
> Yahoo! Groups Links >


__________________________________________________