DSPRelated.com
Forums

C-Coding

Started by raja nayaka November 14, 2002
Hello C6x Pals
Thanks for the response . I would like to share my view about C coding

Once the development of any projects starts of in C using the CSL library ,
Bios and the standard ASCI C luxury of coding , it becomes very difficult to
come back and redefine your code in hand optimized assembly.
So , I believe that one should rather start directely with assembly coding so
that he gets used to it and its difficulty at the beginning itself.
Its my personal opinion and I myself is victim of it.
I would like to know how people around the world plan for their development , is
C their Ist choice or Assembly. I would thank all who responded for my earlier mail.
BABURAO RANE
Catch all the cricket action. Download Yahoo! Score tracker



Raja Nayaka-

> Once the development of any projects starts of in C using the CSL
> library , Bios and the standard ASCI C luxury of coding , it
> becomes very difficult to come back and redefine your code in hand
> optimized assembly.

Except for possible speed/memory usage issues with DSP/BIOS, it's usually a good
idea
to use whatever CCS objects are convenient and will get your project working
faster.
The TI tools are very powerful and can save a lot of time, plus you can get help
on
news groups like this because the TI tools are widely used.

But if you have performance issues, then the next step is to start optimizing
those
sections of your C code which are slow; for example, have loops, multiple
arithmetic
statements, make calls into library functions, etc. Once you have the ability
to
call back and forth between C and asm (which is really not that hard), then it's
no
problem to do this.

In some projects we have only a few asm lang coding sections, usually some
specific
functions. In other projects we have many such functions and the C code starts
to
become like a "shell" or wrapper, existing only to maintain compatibility with
CCS
objects and libraries. The time spent learning how to do this will be well
worth it.

Jeff Brower
DSP sw/hw engineer
Signalogic



For a while now, I have been watching eagerly the experiences that
people have had between C and assembly language coding. While I can
relate to most of the experiences that people have expressed in their
mail, I would like to take a different angle on this subject.

The compiler is a fairly complicated piece of software, and expecting
it to handle all situations, is not realistic. That said, loops which
have a regular trip count or maximum number of iterations are more
suitable for compiler optimization. Remember the primary emphasis
of the compiler is correctness followed by performance. How many times
have you or I, written a hand code that is totally incorrect, or messes
up a register. Generation of incorrect code by a compiler will be a
nuisance and unacceptable to many.

In general giving more information to the compiler by way of _nassert's,
restricts, consts, alignment of pointers, pragma's that request inner loop
unrolling can be significant. Further, in many cases one does not take time to
add this information, and re-write C code (still target independent) using more
advanced loop optimizations. These can be inner/outer loop unrolling,
loop fusion/calesing, loop order interchange. The list is firly exte-
nsive, but the skill is in identifying which one matches the problem at
hand.

Yeah!, it would be nice if the compiler figured it all out, in fact
it would even be better if the compiler generated 100% optimized code
from Simulink block diagrams from Matlab. The fact of the matter is,
that this is not going to happen any time soon. Given this, I have
personally seen speedup's of 3-5x by merely re-writing the same code
in a more compiler friendly manner in C, keeping the architecture in
mind. This is a skill that one develops over time, by interacting with
the compiler. Coupling these loop optimizations, with target specific
instructions using intrinsics can yield another 2-3x. In fact one good
rule of thumb, would be to make sure that your final C code with intri-
nsics is an exact match to the assembly code one has written. This is
possible in all cases as there are intrinsics for all target instructions
that do not have a natural form in ANSI C. The only exception being
circular buffer support in hardware from C. This approach has helped
me cut down by development time, while still maintaining reasonable
levels of performance from 75%-90% of my hand code. There have been
times (few) where it has outwitted my hand code and I have had a
tough time even catching up with it. One of the most important
things to factor into this, is the set of flags that people use:

optimiation level: o2 or o3
flags like:

mw: print information about software pipeliner
mt: assume no bad memory aliases
mx: try multiple scheduling algorithms
mh: perform speculative loads
-oi N: auto inlining threshold
mi N: interrupt threshold.

Rememeber using -g , automatically slows doen code, because not all
the adavanced optimizations can be done and this slows one down by 10-15%.
If the performance from intrinsic C code is not satisfactory, my personal
choice would be the assembly optimizer. Here one does not have to worry
about the pipeline, one can choose the exact instructions that one thinks
will allow for optimum use of the 8-units and let the assembly optimizer
do the rest. Using the -mw flag, with the -k flag produces an ssembly code
listing that goes into great detail about how the compiler performed and
what prevented it from getting an optimum scahedule. Using this information
one can re-write the code in a more efficient way. If the scheduling
algorithm that the assembly optimizer is performing, is resulting in
a lop-sided schedule, then one can further control the scheduling
process by appending .1's and .2's in front of the instruction, to
guide it better. With SA for regular trip count loops, my personal
experience has been 80-95% of the hand coded performance in 1/5th
the development time without worrying about mundane issues like
pipelining, register allocation bugs etc. The .cproc directive auto-
matically makes it C callable. Further one can call a library function
from serial assembly using .call directive (which can be useful).

Having code in SA allows one to trade off performance and code-size using
different compiler flags. It allows one to reap the benefits of newer
compilers. Assembly is a sitting target that is never going to evolve,
hard to maintain. It makes a software group dependent on the coding skills
of one or two specific people. Rather keeping code in SA allows for
others to study the code in a pipeline independent manner. In future,
when a new architecture is available, one can even envision some kind of a
serial ssembly translator from the architecture to the new one. It preserves
software investment both short time and long term. Hence I personally am not
for hand optimized coding, given all these points against it. Even if I
had to develop hand code for 10-15% of the loops that the tools messed
up on, I will develop all the versions of the code I have referred to:

natural C code: text book implementation
optimized C code: C code with advanced loop level optimizations and pragma's
intrinsic C code: C code with intrinsics
Serial assembly code: Linear sequence of assembly instructions
Partitioned Serial assembly code: Code with .1's and .2's to guide optimizer
Hand code: If needed.

This will truly allow one to discern how far the compiler is off from the
final performance. I am confident that this systematic approach will yield
rich dividends. Remember, that the performance of a VLIW architecture is
critically dependent on two things:

a) compiler being able to generate good code;
b) ability of the user provding the code to the compiler, to tailor
the code in a compiler friendly manner.

When both (a) and (b) are present the magic happens.

Happy optimizing.

Regards
Jagadeesh Sankaran


Jagadeesh Sankaran-

All excellent suggestions, very good and very thoughtful. Clearly you are an
experienced TI DSP developer!

But, I would add a comment. From hard experience with the TI tools, we use this
rule
here in our labs: always get it working first with ALL optimizations turned
OFF.
Then we go back and do many of the things you suggest, one module (indeed one
code
section) at a time. I.e. very very carefully, with a lot of intermediate
testing and
bit-exact comparisons to make sure we are not fooling ourselves that it's still
working when it's not.

Mostly, our previously time-consuming run-ins with the tools were due to
hardware-related issues, and early silicon device issues, so this suggestion may
apply less to developers doing just doing pure software/algorithm development,
without need to run on non-DSK hardware.

Jeff Brower
DSP sw/hw engineer
Signalogic > For a while now, I have been watching eagerly the experiences that
> people have had between C and assembly language coding. While I can
> relate to most of the experiences that people have expressed in their
> mail, I would like to take a different angle on this subject.
>
> The compiler is a fairly complicated piece of software, and expecting
> it to handle all situations, is not realistic. That said, loops which
> have a regular trip count or maximum number of iterations are more
> suitable for compiler optimization. Remember the primary emphasis
> of the compiler is correctness followed by performance. How many times
> have you or I, written a hand code that is totally incorrect, or messes
> up a register. Generation of incorrect code by a compiler will be a
> nuisance and unacceptable to many.
>
> In general giving more information to the compiler by way of _nassert's,
> restricts, consts, alignment of pointers, pragma's that request inner loop
> unrolling can be significant. Further, in many cases one does not take time to
> add this information, and re-write C code (still target independent) using
more
> advanced loop optimizations. These can be inner/outer loop unrolling,
> loop fusion/calesing, loop order interchange. The list is firly exte-
> nsive, but the skill is in identifying which one matches the problem at
> hand.
>
> Yeah!, it would be nice if the compiler figured it all out, in fact
> it would even be better if the compiler generated 100% optimized code
> from Simulink block diagrams from Matlab. The fact of the matter is,
> that this is not going to happen any time soon. Given this, I have
> personally seen speedup's of 3-5x by merely re-writing the same code
> in a more compiler friendly manner in C, keeping the architecture in
> mind. This is a skill that one develops over time, by interacting with
> the compiler. Coupling these loop optimizations, with target specific
> instructions using intrinsics can yield another 2-3x. In fact one good
> rule of thumb, would be to make sure that your final C code with intri-
> nsics is an exact match to the assembly code one has written. This is
> possible in all cases as there are intrinsics for all target instructions
> that do not have a natural form in ANSI C. The only exception being
> circular buffer support in hardware from C. This approach has helped
> me cut down by development time, while still maintaining reasonable
> levels of performance from 75%-90% of my hand code. There have been
> times (few) where it has outwitted my hand code and I have had a
> tough time even catching up with it. One of the most important
> things to factor into this, is the set of flags that people use:
>
> optimiation level: o2 or o3
> flags like:
>
> mw: print information about software pipeliner
> mt: assume no bad memory aliases
> mx: try multiple scheduling algorithms
> mh: perform speculative loads
> -oi N: auto inlining threshold
> mi N: interrupt threshold.
>
> Rememeber using -g , automatically slows doen code, because not all
> the adavanced optimizations can be done and this slows one down by 10-15%.
> If the performance from intrinsic C code is not satisfactory, my personal
> choice would be the assembly optimizer. Here one does not have to worry
> about the pipeline, one can choose the exact instructions that one thinks
> will allow for optimum use of the 8-units and let the assembly optimizer
> do the rest. Using the -mw flag, with the -k flag produces an ssembly code
> listing that goes into great detail about how the compiler performed and
> what prevented it from getting an optimum scahedule. Using this information
> one can re-write the code in a more efficient way. If the scheduling
> algorithm that the assembly optimizer is performing, is resulting in
> a lop-sided schedule, then one can further control the scheduling
> process by appending .1's and .2's in front of the instruction, to
> guide it better. With SA for regular trip count loops, my personal
> experience has been 80-95% of the hand coded performance in 1/5th
> the development time without worrying about mundane issues like
> pipelining, register allocation bugs etc. The .cproc directive auto-
> matically makes it C callable. Further one can call a library function
> from serial assembly using .call directive (which can be useful).
>
> Having code in SA allows one to trade off performance and code-size using
> different compiler flags. It allows one to reap the benefits of newer
> compilers. Assembly is a sitting target that is never going to evolve,
> hard to maintain. It makes a software group dependent on the coding skills
> of one or two specific people. Rather keeping code in SA allows for
> others to study the code in a pipeline independent manner. In future,
> when a new architecture is available, one can even envision some kind of a
> serial ssembly translator from the architecture to the new one. It preserves
> software investment both short time and long term. Hence I personally am not
> for hand optimized coding, given all these points against it. Even if I
> had to develop hand code for 10-15% of the loops that the tools messed
> up on, I will develop all the versions of the code I have referred to:
>
> natural C code: text book implementation
> optimized C code: C code with advanced loop level optimizations and pragma's
> intrinsic C code: C code with intrinsics
> Serial assembly code: Linear sequence of assembly instructions
> Partitioned Serial assembly code: Code with .1's and .2's to guide optimizer
> Hand code: If needed.
>
> This will truly allow one to discern how far the compiler is off from the
> final performance. I am confident that this systematic approach will yield
> rich dividends. Remember, that the performance of a VLIW architecture is
> critically dependent on two things:
>
> a) compiler being able to generate good code;
> b) ability of the user provding the code to the compiler, to tailor
> the code in a compiler friendly manner.
>
> When both (a) and (b) are present the magic happens.
>
> Happy optimizing.
>
> Regards
> Jagadeesh Sankaran




Hardware issues can be tough. This is why having something on which you can
instrument is a better option. Even on a serial assembly file, which produces
the best schedule, one can turn off all optimizations by using:

-g -o0 -mu

This prevents any loop optimizations, software pipeling is turned off.
This will immediately cause the same code that generated the optimal
schedule to loose performance and enable line by line debugging. Once
issues are resolved turn back on the full set of flags:

-k -o2 -mwtx -mh -mi -oi1024 for eg.

and you can automagically get back your performance.

This is something that can never be done with a fixed lump of hand
assembly code that someone else developed, with sparse socumentation.

I agree with you on having unit-level testing for bit-exactness. In
general it is better to have two testing flows one with random vectors,
and another with known deterministic vectors. More and more,
as we develop projects with time and cost being crucial crunch factors,
automatic code generation and the ability to exploit it hold the
keys to success. I would like to thank you for your comments on
my previous e-mail. I just thought that it would be better to share
my thought that embedded software has for long given up on sytematic
software vision in order to get performance. My opinion is that
this need not be the case and I am glad to see that you share
my opinion as well.

Regards
Jagadeesh Sankaran




Jagadeesh and Jeff,

Thanks for the posts - that was a very methodical approach to the
process Jagadeesh, Im saving it for future use. I have couple more
questions that I thought I should throw while the iron is hot:

1. I particularly have a problem when there is heavy register usage,
including heavy conditional register usage - when it is tough to make
the compiler understand the way we want to do certain things. For
instance, when there is a long-latency instruction writing to a
register, I would like to use that register in the intermittent
cycles - e.g. LDx src,dst instruction writes to dst, the actual write
is going to happen only 4 cycles from there - so I can use it for
other purposes such as buffering in the meantime. But the compiler
never seems to do this even if we write linear asm - it continues to
complain 'register live too long' and increases ii. Is there some
flag that Im missing ?!

2. The other thing about mastering compiler friendly C-coding is
that - isnt this vendor specific ? I have not worked with many DSPs,
so I dont know ! This is one reason why I dont care about slightly
longer time taken for hand-coded assembly than delving deeper into
compiler eccentricities - IMO, the learning time for both are nearly
the same, and better returns with learning handcoded assembly - I
have seen that there are heavily methodical approaches to handcoded
assembly also. My question hence, is whether compilers across vendors
have similar (if not same) rules for producing optimized VLIW code.

Looking forward to your comments,

TIA
ka
--- In c6x@y..., Jeff Brower <jbrower@s...> wrote:
> Jagadeesh Sankaran-
>
> All excellent suggestions, very good and very thoughtful. Clearly
you are an
> experienced TI DSP developer!
>
> But, I would add a comment. From hard experience with the TI
tools, we use this rule
> here in our labs: always get it working first with ALL
optimizations turned OFF.
> Then we go back and do many of the things you suggest, one module
(indeed one code
> section) at a time. I.e. very very carefully, with a lot of
intermediate testing and
> bit-exact comparisons to make sure we are not fooling ourselves
that it's still
> working when it's not.
>
> Mostly, our previously time-consuming run-ins with the tools were
due to
> hardware-related issues, and early silicon device issues, so this
suggestion may
> apply less to developers doing just doing pure software/algorithm
development,
> without need to run on non-DSK hardware.
>
> Jeff Brower
> DSP sw/hw engineer
> Signalogic



Jagadeesh Sankaran-

> Hardware issues can be tough. This is why having something on which you can
> instrument is a better option.

By "instrument" do you mean modifying the source? If so, that's unacceptable in
our
work. We cannot debug or test performance-sensitive, real-time source code with
any
insertions into the code. Debugging must be by standard JTAG or HPI.

Jeff Brower
DSP sw/hw engineer
Signalogic

> Even on a serial assembly file, which produces
> the best schedule, one can turn off all optimizations by using:
>
> -g -o0 -mu
>
> This prevents any loop optimizations, software pipeling is turned off.
> This will immediately cause the same code that generated the optimal
> schedule to loose performance and enable line by line debugging. Once
> issues are resolved turn back on the full set of flags:
>
> -k -o2 -mwtx -mh -mi -oi1024 for eg.
>
> and you can automagically get back your performance.
>
> This is something that can never be done with a fixed lump of hand
> assembly code that someone else developed, with sparse socumentation.
>
> I agree with you on having unit-level testing for bit-exactness. In
> general it is better to have two testing flows one with random vectors,
> and another with known deterministic vectors. More and more,
> as we develop projects with time and cost being crucial crunch factors,
> automatic code generation and the ability to exploit it hold the
> keys to success. I would like to thank you for your comments on
> my previous e-mail. I just thought that it would be better to share
> my thought that embedded software has for long given up on sytematic
> software vision in order to get performance. My opinion is that
> this need not be the case and I am glad to see that you share
> my opinion as well.
>
> Regards
> Jagadeesh Sankaran




>
>1. I particularly have a problem when there is heavy register usage,
>including heavy conditional register usage - when it is tough to make
>the compiler understand the way we want to do certain things. For
>instance, when there is a long-latency instruction writing to a
>register, I would like to use that register in the intermittent
>cycles - e.g. LDx src,dst instruction writes to dst, the actual write
>is going to happen only 4 cycles from there - so I can use it for
>other purposes such as buffering in the meantime. But the compiler
>never seems to do this even if we write linear asm - it continues to
>complain 'register live too long' and increases ii. Is there some
>flag that Im missing ?!

Live too longs are interesting problems in VLIW scheduling. A live too
long can arise in two different scenarios:

a) When the value of the variable for the next iteration has
already been computed but the variable cannot be updated,
as its current value on this iteration is being used after
a long time.

b) When the value of a variable in any iteration is set through
one of multiple paths, of different length. A simple example of (a) is to read a location, update the value,
write it back using the same pointer. This prevents succesive
values from being loaded because of pointer dependencies. A
simple way to break live too longs is to use register re-naming,
of the varible into a temporary copy, and using the temporary copy
going forward. The compiler does this most of the times in C code
and in serial assembly most of the times. When it cannot automat-
ically figure it out, some manual intervention in doing this helps.

Always load values speculatively and use the loaded values if needed,
rather than predicating the load itself. This does require one to
take care that one is not overstepping the array.

eg:

if (i > 32) i = (j*3);
else i = table[i];

This code can be re-written safely to read with in the array
of 32 possible values as follows:

index_sp = ( i & 5); // keep index modulo 32
value_sp = table[index_sp];
i = (j *3);
if ( i < 96) i = value_sp;

This allows the conditional load to be issued way early, so that
the latency of the load pipe-line can be hidden.

Using the -mx flag, will automatically try to avoid live too
longs. Try using fresh names for temporary variables, as opposed
to using the same name. This avoids creating fictious name de-
pendencies. When the compiler complains about live too long,
take a look at the scheduled code and you will see which source
lines the compiler is not able to resolve by noting the (^) marked
against them in the pipeline listing. Try to see if you can re-
write them. Avoid writing hand code, it is a fixed lump of code
that is a significant drain of one's time and resources. Besides
you can try 100 different things in the same time that you develop
hand code.

On the C62x there are only 32 registers, so register pressure
is defintely possible. On C64x with 64 registers, the possibility
is remote. While coding on C62x try to keep the loop unroll
amounts to reasonable amounts. On highly conditional code,
try to use the multiply as well to issue some of the conditional
operations:

if ( i > 0) j += 32;

flag = ( i > 0);
j = j + (flag * 32);

This code sequence just saved the use of a conditional register.
There are many such tricks, the more you try the more you will
develop your own.
>2. The other thing about mastering compiler friendly C-coding is
>that - isnt this vendor specific ? I have not worked with many DSPs,
>so I dont know ! This is one reason why I dont care about slightly
>longer time taken for hand-coded assembly than delving deeper into
>compiler eccentricities - IMO, the learning time for both are nearly
>the same, and better returns with learning handcoded assembly - I
>have seen that there are heavily methodical approaches to handcoded
>assembly also. My question hence, is whether compilers across vendors
>have similar (if not same) rules for producing optimized VLIW code.

I would say it is more architecture specific. Fortunately most DSP
procesors today are VLIW devices, so there is a lot of commonality.
Hand coded assembly is prone to errors, maintenance difficulties.
On the C6x you have to take care to turn off interrupts when you
are not in single register assignment mode. You have to take
care in maintaining the C code calling convention. You have to
take care that you are doing stack maintenance correctly. It is
just prone to error, that it is best avoided. While the individual
flags are different from vendor to vendor, similarities do exist,
in what is compiler friendly and what is not.

I am not trying to deprecate the use of hand optimized coding. I have
written hand code on several occasions, I just feel that this is not
the right path to developing reliable and safe software. It only gets
harder writing hand code going from C62xto C64x with SIMD and VLIW
together. The compiler and assembly optimizer in my opinion are your
best bets.

Regards
Jagadeesh Sankaran


Akalya-

> 2. The other thing about mastering compiler friendly C-coding is
> that - isnt this vendor specific ?

Of course. But whose other 32-bit DSP are you going to use? Maybe SHARC...
Not
Starcore. Does Motorola even have one? Unforeseen factors, fate, even some
luck,
and no small amount of hard work and dedication by TI have left them the only
DSP
vendor still standing who offers the whole package -- wide range of fxp and flp
solutions, detailed roadmap, extremely dense, low-power, and small packages.
And of
course, excellent development tools.

Jeff Brower
DSP sw/hw engineer
Signalogic

> I have not worked with many DSPs,
> so I dont know ! This is one reason why I dont care about slightly
> longer time taken for hand-coded assembly than delving deeper into
> compiler eccentricities - IMO, the learning time for both are nearly
> the same, and better returns with learning handcoded assembly - I
> have seen that there are heavily methodical approaches to handcoded
> assembly also. My question hence, is whether compilers across vendors
> have similar (if not same) rules for producing optimized VLIW code.
>
> Looking forward to your comments,
>
> TIA
> ka
> --- In c6x@y..., Jeff Brower <jbrower@s...> wrote:
> > Jagadeesh Sankaran-
> >
> > All excellent suggestions, very good and very thoughtful. Clearly
> you are an
> > experienced TI DSP developer!
> >
> > But, I would add a comment. From hard experience with the TI
> tools, we use this rule
> > here in our labs: always get it working first with ALL
> optimizations turned OFF.
> > Then we go back and do many of the things you suggest, one module
> (indeed one code
> > section) at a time. I.e. very very carefully, with a lot of
> intermediate testing and
> > bit-exact comparisons to make sure we are not fooling ourselves
> that it's still
> > working when it's not.
> >
> > Mostly, our previously time-consuming run-ins with the tools were
> due to
> > hardware-related issues, and early silicon device issues, so this
> suggestion may
> > apply less to developers doing just doing pure software/algorithm
> development,
> > without need to run on non-DSK hardware.
> >
> > Jeff Brower
> > DSP sw/hw engineer
> > Signalogic







Mr. Jagadeesh and Mr.Jeff Bower,

If i had a little more time and patience i perhaps would have collected this entire thread on optimizations and edited in a more logical way and mebbe make it available in pdf in public domain!!! (And I still might...)

I have to say, it was a wastly enlightening discussion...though it was something that i couldnt read and digest in one day...cause maybe some of the things described, i may not have actually experienced at all...

Great discussion, thank you guys!!! (bhooshan bows!)

I have a question here, i was going through an optimization recommendation made by an israeli proff on c code, which seems apparently buggy/incorrect to me,but it seems a lill harder to believe that a man trying to teach optimization could make such silly mistakes so,i want your help in understanding whether he is right(if he is then i dont understand what he is doing...)Iam posting the slide below and then my question...

How To write Better C code for DSPs 

                     Use Simple Loops:

 not so good:

-------------------------------

for(k=N;k>0;k--) 

       {

         res[2*k]=x[2*k]*y[2*k]+x[2*k+1]*y[2*k+1];

         res[2*k+1]=x[2*k+1]*y[2*k]-x[2*k]*y[2*k+1];

       }

 --------------------

better:

for(k=N;k>0;k--)

         res[2*k]=x[2*k]+y[2*k];

      for(k=N;k>0;k--)

         res[2*k]+=x[2*k+1]+y[2*k+1];

      for(k=N;k>0;k--)

         res[2*k-1]=x[2*k+1]+y[2*k];

      for(k=N;k>0;k--)

         res[2*k-1]-=x[2*k]*y[2*k+1];  

 --------------------

The second part of the code is not matching the first in correctness!!! and i cant figure out for my life how it cud be the same! as far as the optimization is concerned, may be someone could throw light on why the second method is better optimized than the first one,Jeff,Jagadeesh?(is it FFT code, btw?sum and difference equation?)

and for the benifit of the many other freshers  iam attaching the .ppt made by proff Assaf Kasher from technion(i think!)which talks about dsp optimization along with this mail,enjoy!

 

Regards,

         Bhooshan



Add photos to your e-mail with MSN 8. Get 2 months FREE*.

Attachment (not stored)
lecture4nn.ppt
Type: application/vnd.ms-powerpoint