Sign in

username:

password:



Not a member?

Search c6x



Search tips

Subscribe to c6x



c6x by Keywords

AD535 | BIOS | Booting | Bootloader | C621 | C6211 | C6415 | C671 | C6711 | C6711DSK | C6713 | CCS | Chassaing | COFF | DAT | DM64 | DM642 | DMA | DSK671 | DSK6711 | EDM | EDMA | EMIF | Emulator | EVM | EVM620 | FFT | FIR | GPIO | Halting | HPI | HWI | IDK | JTAG | LDB | LDH | LDW | Linker | LMS | LOG_printf | Matlab | McBSP | MEM_alloc | MIPS | PCI | PCM3003 | Pipeline | Profiling | QDM | Reset | ROM | RTDX | Sampling | SDRAM | Stack | TEB | THS1206 | TMS320C621 | TMS320C6416 | TMS320C6711 | TMS320C6713 | UART | Vector Table | XBUS | XDS560

Ads

Discussion Groups

See Also

Embedded SystemsFPGAElectronics

Technical discussions about the TI C6000 DSPs (including the c62x, c64x and c67x DSPs).

  

Post a new Thread

C-Coding - raja nayaka - Nov 14 10:43:00 2002

Hello C6x Pals
Thanks for the response …. I would like to share my view about C coding

Once the development of any projects starts of in C using the CSL library ,
Bios and the standard ASCI C luxury of coding , it becomes very difficult to
come back and redefine your code in hand optimized assembly.
So , I believe that one should rather start directely with assembly coding so
that he gets used to it and its difficulty at the beginning itself.
Its my personal opinion and I myself is victim of it.
I would like to know how people around the world plan for their development , is
C their Ist choice or Assembly. I would thank all who responded for my earlier mail.
BABURAO RANE
Catch all the cricket action. Download Yahoo! Score tracker



______________________________
Start your Android Ice Cream Sandwich development on TI's AM35x Sitara ARM Cortex-A8 processor today.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: C-Coding - Jeff Brower - Nov 14 14:11:00 2002

Raja Nayaka-

> Once the development of any projects starts of in C using the CSL
> library , Bios and the standard ASCI C luxury of coding , it
> becomes very difficult to come back and redefine your code in hand
> optimized assembly.

Except for possible speed/memory usage issues with DSP/BIOS, it's usually a good
idea
to use whatever CCS objects are convenient and will get your project working
faster.
The TI tools are very powerful and can save a lot of time, plus you can get help
on
news groups like this because the TI tools are widely used.

But if you have performance issues, then the next step is to start optimizing
those
sections of your C code which are slow; for example, have loops, multiple
arithmetic
statements, make calls into library functions, etc. Once you have the ability
to
call back and forth between C and asm (which is really not that hard), then it's
no
problem to do this.

In some projects we have only a few asm lang coding sections, usually some
specific
functions. In other projects we have many such functions and the C code starts
to
become like a "shell" or wrapper, existing only to maintain compatibility with
CCS
objects and libraries. The time spent learning how to do this will be well
worth it.

Jeff Brower
DSP sw/hw engineer
Signalogic



______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

C-Coding - Jagadeesh Sankaran - Nov 14 15:17:00 2002

For a while now, I have been watching eagerly the experiences that
people have had between C and assembly language coding. While I can
relate to most of the experiences that people have expressed in their
mail, I would like to take a different angle on this subject.

The compiler is a fairly complicated piece of software, and expecting
it to handle all situations, is not realistic. That said, loops which
have a regular trip count or maximum number of iterations are more
suitable for compiler optimization. Remember the primary emphasis
of the compiler is correctness followed by performance. How many times
have you or I, written a hand code that is totally incorrect, or messes
up a register. Generation of incorrect code by a compiler will be a
nuisance and unacceptable to many.

In general giving more information to the compiler by way of _nassert's,
restricts, consts, alignment of pointers, pragma's that request inner loop
unrolling can be significant. Further, in many cases one does not take time to
add this information, and re-write C code (still target independent) using more
advanced loop optimizations. These can be inner/outer loop unrolling,
loop fusion/calesing, loop order interchange. The list is firly exte-
nsive, but the skill is in identifying which one matches the problem at
hand.

Yeah!, it would be nice if the compiler figured it all out, in fact
it would even be better if the compiler generated 100% optimized code
from Simulink block diagrams from Matlab. The fact of the matter is,
that this is not going to happen any time soon. Given this, I have
personally seen speedup's of 3-5x by merely re-writing the same code
in a more compiler friendly manner in C, keeping the architecture in
mind. This is a skill that one develops over time, by interacting with
the compiler. Coupling these loop optimizations, with target specific
instructions using intrinsics can yield another 2-3x. In fact one good
rule of thumb, would be to make sure that your final C code with intri-
nsics is an exact match to the assembly code one has written. This is
possible in all cases as there are intrinsics for all target instructions
that do not have a natural form in ANSI C. The only exception being
circular buffer support in hardware from C. This approach has helped
me cut down by development time, while still maintaining reasonable
levels of performance from 75%-90% of my hand code. There have been
times (few) where it has outwitted my hand code and I have had a
tough time even catching up with it. One of the most important
things to factor into this, is the set of flags that people use:

optimiation level: o2 or o3
flags like:

mw: print information about software pipeliner
mt: assume no bad memory aliases
mx: try multiple scheduling algorithms
mh: perform speculative loads
-oi N: auto inlining threshold
mi N: interrupt threshold.

Rememeber using -g , automatically slows doen code, because not all
the adavanced optimizations can be done and this slows one down by 10-15%.
If the performance from intrinsic C code is not satisfactory, my personal
choice would be the assembly optimizer. Here one does not have to worry
about the pipeline, one can choose the exact instructions that one thinks
will allow for optimum use of the 8-units and let the assembly optimizer
do the rest. Using the -mw flag, with the -k flag produces an ssembly code
listing that goes into great detail about how the compiler performed and
what prevented it from getting an optimum scahedule. Using this information
one can re-write the code in a more efficient way. If the scheduling
algorithm that the assembly optimizer is performing, is resulting in
a lop-sided schedule, then one can further control the scheduling
process by appending .1's and .2's in front of the instruction, to
guide it better. With SA for regular trip count loops, my personal
experience has been 80-95% of the hand coded performance in 1/5th
the development time without worrying about mundane issues like
pipelining, register allocation bugs etc. The .cproc directive auto-
matically makes it C callable. Further one can call a library function
from serial assembly using .call directive (which can be useful).

Having code in SA allows one to trade off performance and code-size using
different compiler flags. It allows one to reap the benefits of newer
compilers. Assembly is a sitting target that is never going to evolve,
hard to maintain. It makes a software group dependent on the coding skills
of one or two specific people. Rather keeping code in SA allows for
others to study the code in a pipeline independent manner. In future,
when a new architecture is available, one can even envision some kind of a
serial ssembly translator from the architecture to the new one. It preserves
software investment both short time and long term. Hence I personally am not
for hand optimized coding, given all these points against it. Even if I
had to develop hand code for 10-15% of the loops that the tools messed
up on, I will develop all the versions of the code I have referred to:

natural C code: text book implementation
optimized C code: C code with advanced loop level optimizations and pragma's
intrinsic C code: C code with intrinsics
Serial assembly code: Linear sequence of assembly instructions
Partitioned Serial assembly code: Code with .1's and .2's to guide optimizer
Hand code: If needed.

This will truly allow one to discern how far the compiler is off from the
final performance. I am confident that this systematic approach will yield
rich dividends. Remember, that the performance of a VLIW architecture is
critically dependent on two things:

a) compiler being able to generate good code;
b) ability of the user provding the code to the compiler, to tailor
the code in a compiler friendly manner.

When both (a) and (b) are present the magic happens.

Happy optimizing.

Regards
Jagadeesh Sankaran


______________________________
Start your Android Ice Cream Sandwich development on TI's AM35x Sitara ARM Cortex-A8 processor today.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: C-Coding - Jeff Brower - Nov 15 3:09:00 2002

Jagadeesh Sankaran-

All excellent suggestions, very good and very thoughtful. Clearly you are an
experienced TI DSP developer!

But, I would add a comment. From hard experience with the TI tools, we use this
rule
here in our labs: always get it working first with ALL optimizations turned
OFF.
Then we go back and do many of the things you suggest, one module (indeed one
code
section) at a time. I.e. very very carefully, with a lot of intermediate
testing and
bit-exact comparisons to make sure we are not fooling ourselves that it's still
working when it's not.

Mostly, our previously time-consuming run-ins with the tools were due to
hardware-related issues, and early silicon device issues, so this suggestion may
apply less to developers doing just doing pure software/algorithm development,
without need to run on non-DSK hardware.

Jeff Brower
DSP sw/hw engineer
Signalogic > For a while now, I have been watching eagerly the experiences that
> people have had between C and assembly language coding. While I can
> relate to most of the experiences that people have expressed in their
> mail, I would like to take a different angle on this subject.
>
> The compiler is a fairly complicated piece of software, and expecting
> it to handle all situations, is not realistic. That said, loops which
> have a regular trip count or maximum number of iterations are more
> suitable for compiler optimization. Remember the primary emphasis
> of the compiler is correctness followed by performance. How many times
> have you or I, written a hand code that is totally incorrect, or messes
> up a register. Generation of incorrect code by a compiler will be a
> nuisance and unacceptable to many.
>
> In general giving more information to the compiler by way of _nassert's,
> restricts, consts, alignment of pointers, pragma's that request inner loop
> unrolling can be significant. Further, in many cases one does not take time to
> add this information, and re-write C code (still target independent) using
more
> advanced loop optimizations. These can be inner/outer loop unrolling,
> loop fusion/calesing, loop order interchange. The list is firly exte-
> nsive, but the skill is in identifying which one matches the problem at
> hand.
>
> Yeah!, it would be nice if the compiler figured it all out, in fact
> it would even be better if the compiler generated 100% optimized code
> from Simulink block diagrams from Matlab. The fact of the matter is,
> that this is not going to happen any time soon. Given this, I have
> personally seen speedup's of 3-5x by merely re-writing the same code
> in a more compiler friendly manner in C, keeping the architecture in
> mind. This is a skill that one develops over time, by interacting with
> the compiler. Coupling these loop optimizations, with target specific
> instructions using intrinsics can yield another 2-3x. In fact one good
> rule of thumb, would be to make sure that your final C code with intri-
> nsics is an exact match to the assembly code one has written. This is
> possible in all cases as there are intrinsics for all target instructions
> that do not have a natural form in ANSI C. The only exception being
> circular buffer support in hardware from C. This approach has helped
> me cut down by development time, while still maintaining reasonable
> levels of performance from 75%-90% of my hand code. There have been
> times (few) where it has outwitted my hand code and I have had a
> tough time even catching up with it. One of the most important
> things to factor into this, is the set of flags that people use:
>
> optimiation level: o2 or o3
> flags like:
>
> mw: print information about software pipeliner
> mt: assume no bad memory aliases
> mx: try multiple scheduling algorithms
> mh: perform speculative loads
> -oi N: auto inlining threshold
> mi N: interrupt threshold.
>
> Rememeber using -g , automatically slows doen code, because not all
> the adavanced optimizations can be done and this slows one down by 10-15%.
> If the performance from intrinsic C code is not satisfactory, my personal
> choice would be the assembly optimizer. Here one does not have to worry
> about the pipeline, one can choose the exact instructions that one thinks
> will allow for optimum use of the 8-units and let the assembly optimizer
> do the rest. Using the -mw flag, with the -k flag produces an ssembly code
> listing that goes into great detail about how the compiler performed and
> what prevented it from getting an optimum scahedule. Using this information
> one can re-write the code in a more efficient way. If the scheduling
> algorithm that the assembly optimizer is performing, is resulting in
> a lop-sided schedule, then one can further control the scheduling
> process by appending .1's and .2's in front of the instruction, to
> guide it better. With SA for regular trip count loops, my personal
> experience has been 80-95% of the hand coded performance in 1/5th
> the development time without worrying about mundane issues like
> pipelining, register allocation bugs etc. The .cproc directive auto-
> matically makes it C callable. Further one can call a library function
> from serial assembly using .call directive (which can be useful).
>
> Having code in SA allows one to trade off performance and code-size using
> different compiler flags. It allows one to reap the benefits of newer
> compilers. Assembly is a sitting target that is never going to evolve,
> hard to maintain. It makes a software group dependent on the coding skills
> of one or two specific people. Rather keeping code in SA allows for
> others to study the code in a pipeline independent manner. In future,
> when a new architecture is available, one can even envision some kind of a
> serial ssembly translator from the architecture to the new one. It preserves
> software investment both short time and long term. Hence I personally am not
> for hand optimized coding, given all these points against it. Even if I
> had to develop hand code for 10-15% of the loops that the tools messed
> up on, I will develop all the versions of the code I have referred to:
>
> natural C code: text book implementation
> optimized C code: C code with advanced loop level optimizations and pragma's
> intrinsic C code: C code with intrinsics
> Serial assembly code: Linear sequence of assembly instructions
> Partitioned Serial assembly code: Code with .1's and .2's to guide optimizer
> Hand code: If needed.
>
> This will truly allow one to discern how far the compiler is off from the
> final performance. I am confident that this systematic approach will yield
> rich dividends. Remember, that the performance of a VLIW architecture is
> critically dependent on two things:
>
> a) compiler being able to generate good code;
> b) ability of the user provding the code to the compiler, to tailor
> the code in a compiler friendly manner.
>
> When both (a) and (b) are present the magic happens.
>
> Happy optimizing.
>
> Regards
> Jagadeesh Sankaran








(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: C-Coding - Jagadeesh Sankaran - Nov 15 16:39:00 2002

Hardware issues can be tough. This is why having something on which you can
instrument is a better option. Even on a serial assembly file, which produces
the best schedule, one can turn off all optimizations by using:

-g -o0 -mu

This prevents any loop optimizations, software pipeling is turned off.
This will immediately cause the same code that generated the optimal
schedule to loose performance and enable line by line debugging. Once
issues are resolved turn back on the full set of flags:

-k -o2 -mwtx -mh -mi -oi1024 for eg.

and you can automagically get back your performance.

This is something that can never be done with a fixed lump of hand
assembly code that someone else developed, with sparse socumentation.

I agree with you on having unit-level testing for bit-exactness. In
general it is better to have two testing flows one with random vectors,
and another with known deterministic vectors. More and more,
as we develop projects with time and cost being crucial crunch factors,
automatic code generation and the ability to exploit it hold the
keys to success. I would like to thank you for your comments on
my previous e-mail. I just thought that it would be better to share
my thought that embedded software has for long given up on sytematic
software vision in order to get performance. My opinion is that
this need not be the case and I am glad to see that you share
my opinion as well.

Regards
Jagadeesh Sankaran



______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: C-Coding - akalya - Nov 15 17:40:00 2002


Jagadeesh and Jeff,

Thanks for the posts - that was a very methodical approach to the
process Jagadeesh, Im saving it for future use. I have couple more
questions that I thought I should throw while the iron is hot:

1. I particularly have a problem when there is heavy register usage,
including heavy conditional register usage - when it is tough to make
the compiler understand the way we want to do certain things. For
instance, when there is a long-latency instruction writing to a
register, I would like to use that register in the intermittent
cycles - e.g. LDx src,dst instruction writes to dst, the actual write
is going to happen only 4 cycles from there - so I can use it for
other purposes such as buffering in the meantime. But the compiler
never seems to do this even if we write linear asm - it continues to
complain 'register live too long' and increases ii. Is there some
flag that Im missing ?!

2. The other thing about mastering compiler friendly C-coding is
that - isnt this vendor specific ? I have not worked with many DSPs,
so I dont know ! This is one reason why I dont care about slightly
longer time taken for hand-coded assembly than delving deeper into
compiler eccentricities - IMO, the learning time for both are nearly
the same, and better returns with learning handcoded assembly - I
have seen that there are heavily methodical approaches to handcoded
assembly also. My question hence, is whether compilers across vendors
have similar (if not same) rules for producing optimized VLIW code.

Looking forward to your comments,

TIA
ka
--- In c6x@y..., Jeff Brower <jbrower@s...> wrote:
> Jagadeesh Sankaran-
>
> All excellent suggestions, very good and very thoughtful. Clearly
you are an
> experienced TI DSP developer!
>
> But, I would add a comment. From hard experience with the TI
tools, we use this rule
> here in our labs: always get it working first with ALL
optimizations turned OFF.
> Then we go back and do many of the things you suggest, one module
(indeed one code
> section) at a time. I.e. very very carefully, with a lot of
intermediate testing and
> bit-exact comparisons to make sure we are not fooling ourselves
that it's still
> working when it's not.
>
> Mostly, our previously time-consuming run-ins with the tools were
due to
> hardware-related issues, and early silicon device issues, so this
suggestion may
> apply less to developers doing just doing pure software/algorithm
development,
> without need to run on non-DSK hardware.
>
> Jeff Brower
> DSP sw/hw engineer
> Signalogic


______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: C-Coding - Jeff Brower - Nov 15 18:58:00 2002

Jagadeesh Sankaran-

> Hardware issues can be tough. This is why having something on which you can
> instrument is a better option.

By "instrument" do you mean modifying the source? If so, that's unacceptable in
our
work. We cannot debug or test performance-sensitive, real-time source code with
any
insertions into the code. Debugging must be by standard JTAG or HPI.

Jeff Brower
DSP sw/hw engineer
Signalogic

> Even on a serial assembly file, which produces
> the best schedule, one can turn off all optimizations by using:
>
> -g -o0 -mu
>
> This prevents any loop optimizations, software pipeling is turned off.
> This will immediately cause the same code that generated the optimal
> schedule to loose performance and enable line by line debugging. Once
> issues are resolved turn back on the full set of flags:
>
> -k -o2 -mwtx -mh -mi -oi1024 for eg.
>
> and you can automagically get back your performance.
>
> This is something that can never be done with a fixed lump of hand
> assembly code that someone else developed, with sparse socumentation.
>
> I agree with you on having unit-level testing for bit-exactness. In
> general it is better to have two testing flows one with random vectors,
> and another with known deterministic vectors. More and more,
> as we develop projects with time and cost being crucial crunch factors,
> automatic code generation and the ability to exploit it hold the
> keys to success. I would like to thank you for your comments on
> my previous e-mail. I just thought that it would be better to share
> my thought that embedded software has for long given up on sytematic
> software vision in order to get performance. My opinion is that
> this need not be the case and I am glad to see that you share
> my opinion as well.
>
> Regards
> Jagadeesh Sankaran


______________________________
Start your Android Ice Cream Sandwich development on TI's AM35x Sitara ARM Cortex-A8 processor today.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: C-Coding - Jagadeesh Sankaran - Nov 15 19:26:00 2002


>
>1. I particularly have a problem when there is heavy register usage,
>including heavy conditional register usage - when it is tough to make
>the compiler understand the way we want to do certain things. For
>instance, when there is a long-latency instruction writing to a
>register, I would like to use that register in the intermittent
>cycles - e.g. LDx src,dst instruction writes to dst, the actual write
>is going to happen only 4 cycles from there - so I can use it for
>other purposes such as buffering in the meantime. But the compiler
>never seems to do this even if we write linear asm - it continues to
>complain 'register live too long' and increases ii. Is there some
>flag that Im missing ?!

Live too longs are interesting problems in VLIW scheduling. A live too
long can arise in two different scenarios:

a) When the value of the variable for the next iteration has
already been computed but the variable cannot be updated,
as its current value on this iteration is being used after
a long time.

b) When the value of a variable in any iteration is set through
one of multiple paths, of different length. A simple example of (a) is to read a location, update the value,
write it back using the same pointer. This prevents succesive
values from being loaded because of pointer dependencies. A
simple way to break live too longs is to use register re-naming,
of the varible into a temporary copy, and using the temporary copy
going forward. The compiler does this most of the times in C code
and in serial assembly most of the times. When it cannot automat-
ically figure it out, some manual intervention in doing this helps.

Always load values speculatively and use the loaded values if needed,
rather than predicating the load itself. This does require one to
take care that one is not overstepping the array.

eg:

if (i > 32) i = (j*3);
else i = table[i];

This code can be re-written safely to read with in the array
of 32 possible values as follows:

index_sp = ( i & 5); // keep index modulo 32
value_sp = table[index_sp];
i = (j *3);
if ( i < 96) i = value_sp;

This allows the conditional load to be issued way early, so that
the latency of the load pipe-line can be hidden.

Using the -mx flag, will automatically try to avoid live too
longs. Try using fresh names for temporary variables, as opposed
to using the same name. This avoids creating fictious name de-
pendencies. When the compiler complains about live too long,
take a look at the scheduled code and you will see which source
lines the compiler is not able to resolve by noting the (^) marked
against them in the pipeline listing. Try to see if you can re-
write them. Avoid writing hand code, it is a fixed lump of code
that is a significant drain of one's time and resources. Besides
you can try 100 different things in the same time that you develop
hand code.

On the C62x there are only 32 registers, so register pressure
is defintely possible. On C64x with 64 registers, the possibility
is remote. While coding on C62x try to keep the loop unroll
amounts to reasonable amounts. On highly conditional code,
try to use the multiply as well to issue some of the conditional
operations:

if ( i > 0) j += 32;

flag = ( i > 0);
j = j + (flag * 32);

This code sequence just saved the use of a conditional register.
There are many such tricks, the more you try the more you will
develop your own.
>2. The other thing about mastering compiler friendly C-coding is
>that - isnt this vendor specific ? I have not worked with many DSPs,
>so I dont know ! This is one reason why I dont care about slightly
>longer time taken for hand-coded assembly than delving deeper into
>compiler eccentricities - IMO, the learning time for both are nearly
>the same, and better returns with learning handcoded assembly - I
>have seen that there are heavily methodical approaches to handcoded
>assembly also. My question hence, is whether compilers across vendors
>have similar (if not same) rules for producing optimized VLIW code.

I would say it is more architecture specific. Fortunately most DSP
procesors today are VLIW devices, so there is a lot of commonality.
Hand coded assembly is prone to errors, maintenance difficulties.
On the C6x you have to take care to turn off interrupts when you
are not in single register assignment mode. You have to take
care in maintaining the C code calling convention. You have to
take care that you are doing stack maintenance correctly. It is
just prone to error, that it is best avoided. While the individual
flags are different from vendor to vendor, similarities do exist,
in what is compiler friendly and what is not.

I am not trying to deprecate the use of hand optimized coding. I have
written hand code on several occasions, I just feel that this is not
the right path to developing reliable and safe software. It only gets
harder writing hand code going from C62xto C64x with SIMD and VLIW
together. The compiler and assembly optimizer in my opinion are your
best bets.

Regards
Jagadeesh Sankaran


______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: Re: C-Coding - Jeff Brower - Nov 16 0:14:00 2002

Akalya-

> 2. The other thing about mastering compiler friendly C-coding is
> that - isnt this vendor specific ?

Of course. But whose other 32-bit DSP are you going to use? Maybe SHARC...
Not
Starcore. Does Motorola even have one? Unforeseen factors, fate, even some
luck,
and no small amount of hard work and dedication by TI have left them the only
DSP
vendor still standing who offers the whole package -- wide range of fxp and flp
solutions, detailed roadmap, extremely dense, low-power, and small packages.
And of
course, excellent development tools.

Jeff Brower
DSP sw/hw engineer
Signalogic

> I have not worked with many DSPs,
> so I dont know ! This is one reason why I dont care about slightly
> longer time taken for hand-coded assembly than delving deeper into
> compiler eccentricities - IMO, the learning time for both are nearly
> the same, and better returns with learning handcoded assembly - I
> have seen that there are heavily methodical approaches to handcoded
> assembly also. My question hence, is whether compilers across vendors
> have similar (if not same) rules for producing optimized VLIW code.
>
> Looking forward to your comments,
>
> TIA
> ka
> --- In c6x@y..., Jeff Brower <jbrower@s...> wrote:
> > Jagadeesh Sankaran-
> >
> > All excellent suggestions, very good and very thoughtful. Clearly
> you are an
> > experienced TI DSP developer!
> >
> > But, I would add a comment. From hard experience with the TI
> tools, we use this rule
> > here in our labs: always get it working first with ALL
> optimizations turned OFF.
> > Then we go back and do many of the things you suggest, one module
> (indeed one code
> > section) at a time. I.e. very very carefully, with a lot of
> intermediate testing and
> > bit-exact comparisons to make sure we are not fooling ourselves
> that it's still
> > working when it's not.
> >
> > Mostly, our previously time-consuming run-ins with the tools were
> due to
> > hardware-related issues, and early silicon device issues, so this
> suggestion may
> > apply less to developers doing just doing pure software/algorithm
> development,
> > without need to run on non-DSK hardware.
> >
> > Jeff Brower
> > DSP sw/hw engineer
> > Signalogic



______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

Re: C-Coding - Bhooshan iyer - Nov 21 10:52:00 2002





Mr. Jagadeesh and Mr.Jeff Bower,

If i had a little more time and patience i perhaps would have collected this entire thread on optimizations and edited in a more logical way and mebbe make it available in pdf in public domain!!! (And I still might...)

I have to say, it was a wastly enlightening discussion...though it was something that i couldnt read and digest in one day...cause maybe some of the things described, i may not have actually experienced at all...

Great discussion, thank you guys!!! (bhooshan bows!)

I have a question here, i was going through an optimization recommendation made by an israeli proff on c code, which seems apparently buggy/incorrect to me,but it seems a lill harder to believe that a man trying to teach optimization could make such silly mistakes so,i want your help in understanding whether he is right(if he is then i dont understand what he is doing...)Iam posting the slide below and then my question...

How To write Better C code for DSP’s 

           & nbsp;         Use Simple Loops:

 not so good:

----------------------------------------------------------------- --

for(k=N;k>0;k--) 

       {

         res[2*k]=x[2*k]*y[ 2*k]+x[2*k+1]*y[2*k+1];

         res[2*k+1]=x[2*k+1 ]*y[2*k]-x[2*k]*y[2*k+1];

       }

 -------------------------------------------------------------- ------------------------------

better:

for(k=N;k>0;k--)

         res[2*k]=x[2*k]+y[ 2*k];

      for(k=N;k>0;k--)

         res[2*k]+=x[2*k+1] +y[2*k+1];

      for(k=N;k>0;k--)

         res[2*k-1]=x[2*k+1 ]+y[2*k];

      for(k=N;k>0;k--)

         res[2*k-1]-=x[2*k] *y[2*k+1];  

 -------------------------------------------------------------- ------------------------------

The second part of the code is not matching the first in correctness!!! and i cant figure out for my life how it cud be the same! as far as the optimization is concerned, may be someone could throw light on why the second method is better optimized than the first one,Jeff,Jagadeesh?(is it FFT code, btw?sum and difference equation?)

and for the benifit of the many other freshers  iam attaching the .ppt made by proff Assaf Kasher from technion(i think!)which talks about dsp optimization along with this mail,enjoy!

 

Regards,

         Bhooshan



Add photos to your e-mail with MSN 8. Get 2 months FREE*.

Attachment (not stored)
lecture4nn.ppt
Type: application/vnd.ms-powerpoint


______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

RE: C-Coding - William Zimmerman - Nov 21 14:55:00 2002

Message
Bhooshan
 
As far as the correctness of the code, the operations on the right sides of the equations should all be multiplies, not adds.  It was probably just a transcription error. 
 
Bill
-----Original Message-----
From: Bhooshan iyer [mailto:b...@hotmail.com]
Sent: Thursday, November 21, 2002 5:53 AM
To: c...@yahoogroups.com
Subject: Re: [c6x] C-Coding





Mr. Jagadeesh and Mr.Jeff Bower,

If i had a little more time and patience i perhaps would have collected this entire thread on optimizations and edited in a more logical way and mebbe make it available in pdf in public domain!!! (And I still might...)

I have to say, it was a wastly enlightening discussion...though it was something that i couldnt read and digest in one day...cause maybe some of the things described, i may not have actually experienced at all...

Great discussion, thank you guys!!! (bhooshan bows!)

I have a question here, i was going through an optimization recommendation made by an israeli proff on c code, which seems apparently buggy/incorrect to me,but it seems a lill harder to believe that a man trying to teach optimization could make such silly mistakes so,i want your help in understanding whether he is right(if he is then i dont understand what he is doing...)Iam posting the slide below and then my question...

How To write Better C code for DSP’s 

                     Use Simple Loops:

 not so good:

----------------------------------------------------------------- --

for(k=N;k>0;k--) 

       {

         res[2*k]=x[2*k]*y[2*k]+x[2*k+1]*y[2*k+1];

         res[2*k+1]=x[2*k+1]*y[2*k]-x[2*k]*y[2*k+1];

       }

 ------------------------------------------------------------------ --------------------------

better:

for(k=N;k>0;k--)

         res[2*k]=x[2*k]+y[2*k];

      for(k=N;k>0;k--)

         res[2*k]+=x[2*k+1]+y[2*k+1];

      for(k=N;k>0;k--)

         res[2*k-1]=x[2*k+1]+y[2*k];

      for(k=N;k>0;k--)

         res[2*k-1]-=x[2*k]*y[2*k+1];  

 ------------------------------------------------------------------ --------------------------

The second part of the code is not matching the first in correctness!!! and i cant figure out for my life how it cud be the same! as far as the optimization is concerned, may be someone could throw light on why the second method is better optimized than the first one,Jeff,Jagadeesh?(is it FFT code, btw?sum and difference equation?)

and for the benifit of the many other freshers  iam attaching the .ppt made by proff Assaf Kasher from technion(i think!)which talks about dsp optimization along with this mail,enjoy!

 

Regards,

         Bhooshan






______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

RE: C-Coding - Bhooshan iyer - Nov 22 6:09:00 2002


Thank you Mr.Bower,Mr.Zimmerman for the answers...And yes,Mr.Zimmerman the * instead of + seems to do the trick and seems a rather silly mistake!

But i guess as Mr.Bower mentioned there seems to be a loss of clarity, atleast to me(otherwise i wudnt have even posted the message!)...so i guess,ill try Mr.Bower's suggestions and profile the code to see how good the results,ill keep the group posted on that...

Thank you guys for replying

Regards,

           Bhooshan


 

Bhooshan
As far as the correctness of the code, the operations on the right sides
of the equations should all be multiplies, not adds. It was probably
just a transcription error.
Bill
 







(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )

RE: C-Coding - Jagadeesh Sankaran - Nov 22 16:57:00 2002

This is an interesting and often occurring snippet of code. I thought
I will take the time to be a little bit more concrete in showing off
the power of compiler tools. Hence I have followed the approach I
outlined earlier. I have been extremely busy and while I think the
code is correct I have not neceesarily had a chance to validate the
code.

void foo_cn(const short *restrict x, int N, const short *restrict y, short *rest
rict res)
{
int k;

for(k = N;k > 0;k--)
{
res[2*k] = x[2*k]*y[2*k] + x[2*k+1] * y[2*k+1];
res[2*k+1] = x[2*k+1]*y[2*k] - x[2*k] * y[2*k+1];
}
} Here is the code that you provided. I took the liberty of adding
a few const's and restricts to help the compiler. Nothing spectacular,
but simple infor to guide it in better code generation. I used
the 4.31 of the tools and typed:

cl6x -k -o2 -mwtx -mv6200 opt.c

I am now reproducing the piped loop kernel (only) that the compiler
gave:

;** --------------------------------------------------------------------------*
L2: ; PIPED LOOP KERNEL

MPY .M2X B7,A5,B8 ; |8| <1,6> ^
|| MPY .M1X B7,A6,A6 ; |7| <1,6> ^
|| [ B0] B .S1 L2 ; |9| <1,6>
|| [ A1] LDH .D2T2 *+B4(2),B9 ; |7| <3,0> ^
|| [ A1] LDH .D1T1 *+A0(2),A5 ; |7| <3,0> ^

[ A2] MPYSU .M1 2,A2,A2 ; <0,10>
|| SUB .S2 B6,B8,B6 ; |8| <0,10> ^
|| MPY .M2X B9,A6,B6 ; |8| <1,7> ^
|| [ A1] LDH .D1T1 *A0--(4),A6 ; |7| <3,1> ^
|| [ A1] LDH .D2T2 *B4--(4),B7 ; |7| <3,1> ^

[ A1] SUB .S1 A1,1,A1 ; <0,11>
|| [!A2] STH .D1T1 A7,*--A4(4) ; |7| <0,11> ^
|| [!A2] STH .D2T2 B6,*--B5(4) ; |8| <0,11> ^
|| ADD .L1 A3,A6,A7 ; |7| <1,8> ^
|| [ B0] SUB .S2 B0,1,B0 ; |9| <2,5> ^
|| MPY .M1X B9,A5,A3 ; |7| <2,5> ^

;** --------------------------------------------------------------------------*

One can see that one iteration of the loop in steady state takes 3
cycles. The compiler guards the loads and stores by6 predicates impl.
advanced loop collapsing techniques to reduce code-size for prolog
and epilog. For example in this code, it collapsed 3 epilog stages
and 2 prolog stages. This code performs 4 mutliplies in 3 cycles.
The theoretical minimum cycle count is 2 cycles. Hence we get 66%
efficient code. Not bad for a few minutes worth of effort.

Then I added some iformation about alignment of arrays, word-aligned
even number of iterations, trip count and requested the compiler to
perform automatic inner loop unrolling to see if I got a better schedule.
Here are the C code changes you require to do this:

void foo_co(const short *restrict x, int N, const short *restrict y, short
*restrict res)
{
int k;

_nassert((int)N%2 == 0);
_nassert((int)x%4 == 0);
_nassert(N >= 64);
_nassert((int)y%4 == 0);
_nassert((int)res%4 == 0);
#pragma UNROLL (2);
for(k = N;k > 0;k--)
{
res[2*k] = x[2*k]*y[2*k] + x[2*k+1] * y[2*k+1];
res[2*k+1] = x[2*k+1]*y[2*k] - x[2*k] * y[2*k+1];
}
}

Once again I am reproducing the piped loop kernel only for this code.

;** --------------------------------------------------------------------------*
L5: ; PIPED LOOP KERNEL

ADD .L1 A5,A0,A5 ; |25| <0,10>
|| MPYHL .M2X B6,A3,B9 ; |26| <1,5>
|| MV .S1X B6,A8 ; |25| <1,5> Define a twin register
|| LDW .D1T1 *+A4(4),A3 ; |25| <2,0>
|| LDW .D2T2 *+B5(4),B6 ; |25| <2,0>

MV .S2X A5,B3 ; |25| <0,11> Define a twin reg.
|| MPYHL .M1X B7,A6,A5 ; |26| <1,6>
|| LDW .D1T1 *A4--(8),A6 ; |25| <2,1>
|| LDW .D2T2 *B5--(8),B7 ; |25| <2,1>

ADD .S1X B2,A0,A0 ; |25| <0,12>
|| [!A1] STH .D2T2 B3,*B4--(8) ; |25| <0,12>
|| MPYHL .M2X A3,B6,B6 ; |26| <1,7>
|| MPY .M1 A8,A3,A0 ; |25| <1,7>

SUB .L1X A9,B8,A3 ; |26| <0,13>
|| [!A1] STH .D1T1 A0,*A7--(8) ; |25| <0,13>
|| [!A1] STH .D2T2 B1,*+B4(10) ; |26| <0,13>
|| MV .S1 A5,A9 ; |26| <1,8> Split a long life
|| [ B0] SUB .S2 B0,1,B0 ; |27| <1,8>
|| MPYH .M1 A8,A3,A5 ; |25| <1,8>
|| MPYH .M2X B7,A6,B2 ; |25| <1,8>

[ A1] SUB .S1 A1,1,A1 ; <0,14>
|| [!A1] STH .D1T1 A3,*+A7(10) ; |26| <0,14>
|| MPYHL .M2X A6,B7,B8 ; |26| <1,9>
|| SUB .D2 B9,B6,B1 ; |26| <1,9>
|| [ B0] B .S2 L5 ; |27| <1,9>
|| MPY .M1X B7,A6,A0 ; |25| <1,9>

;** --------------------------------------------------------------------------*

The compiler found a 5 cycle loop for 2 iterations. Hence it performs 1
iteration of the loop in 2.5 cycles. As I said earlier the theoretical best
for this loop is 2 cycles based on M unit utilization. This code is 80%
efficient to the theoretical best throughput you can expect. Some time
ago I replied about live too longs, see how the compiler is automatically
making copies to avoid live too longs. Not bad for another 3- minutes of
effort.

I then started writing intrinsic C code, this gave only a 7-cycle
loop for 2 iters. as shown below, but still was useful for me to
develop Serial assembly code. I am only going to show the C code
so that one can easily follow the SA code.

void foo_c(const short *restrict x, int N, const short *restrict y, short
*restrict res)
{
int k;
int xword0;
int yword0;
int xword1;
int yword1;

int xt0;
int yt0;
int oword0;
int xt1;
int yt1;
int oword1;

for(k = 0; k > (N >> 1);k+= 2)
{

xword0 = _amem4_const(&x[0]);
yword0 = _amem4_const(&y[0]);
xword1 = _amem4_const(&x[2]);
yword1 = _amem4_const(&y[2]);

x += 4;
y += 4;

/*---------------------------------------------------------*/
/* res[2*k] = x[2*k]*y[2*k] + x[2*k+1] * y[2*k+1]; */
/* res[2*k+1] = x[2*k+1]*y[2*k] - x[2*k] * y[2*k+1]; */
/*---------------------------------------------------------*/

xt0 = _mpy(xword0, yword0) + _mpyh(xword0, yword0);
yt0 = _mpyhl(xword0, yword0) - _mpylh(xword0, yword0);
xt1 = _mpy(xword1, yword1) + _mpyh(xword1, yword1);
yt1 = _mpyhl(xword1, yword1) - _mpylh(xword1, yword1);

oword0 = (yt0 << 16) + (xt0 & 0xFFFF);
oword1 = (yt1 << 16) + (xt1 & 0xFFFF);
_amem4(&res[0]) = oword0;
_amem4(&res[2]) = oword1;
res += 4;

}
}

Here is the SA code:

.global _foo_sa
foo_sa: .cproc A_x, B_n, A_y, B_res

.reg A_xp, B_xp, A_yp, B_yp
.reg A_rp, B_rp, A_xw0, B_xw1
.reg A_yw0, B_yw1, A_rw0, B_rw1

.reg A_xt00, A_xt01, A_yt00, A_yt01
.reg B_xt10, B_xt11, B_yt10, B_yt11
.reg A_xt0, A_yt0, B_xt1, B_yt1 MV A_x, A_xp
ADD A_x, 4, B_xp

MV A_y, A_yp
ADD A_y, 4, B_yp

MV B_res, A_rp
ADD B_res, 2, B_rp

LOOP:
LDW.D1T1 *A_xp++[2], A_xw0
LDW.D2T2 *B_xp++[2], B_xw1
LDW.D1T1 *A_yp++[2], A_yw0
LDW.D2T2 *B_yp++[2], B_yw1

MPY.1 A_xw0, A_yw0, A_xt00
MPYH.1 A_xw0, A_yw0, A_xt01
MPYHL.1 A_xw0, A_yw0, A_yt00
MPYLH.1 A_xw0, A_yw0, A_yt01

MPY.2 B_xw1, B_yw1, B_xt10
MPYH.2 B_xw1, B_yw1, B_xt11
MPYHL.2 B_xw1, B_yw1, B_yt10
MPYLH.2 B_xw1, B_yw1, B_yt11

ADD.1 A_xt00, A_xt01, A_xt0
ADD.1 A_yt00, A_yt01, A_yt0
ADD.2 B_xt10, B_xt11, B_xt1
ADD.2 B_yt10, B_yt11, B_yt1

STH.D1T1 A_xt0, *A_rp++[2]
STH.D2T1 A_yt0, *B_rp++[2]
STH.D1T2 B_xt1, *A_rp++[2]
STH.D2T2 B_yt1, *B_rp++[2]

[B_n] SUB.2 B_n, 1, B_n
[B_n] B.2 LOOP

.return
.endproc

And finally here is the piece of diamond we have been hunting for. Is'nt it
a joy to get automatic code gen to find this, without writing painful
hand code (Ik!).

;** --------------------------------------------------------------------------*
LOOP: ; PIPED LOOP KERNEL

MPYLH .M2 B9,B12,B13 ; |76| <0,12>
|| MV .S2 B8,B12 ; |66| <1,8> Split a long life
|| MV .L2 B11,B9 ; |64| <1,8> Split a long life
|| MPYLH .M1 A6,A11,A7 ; |71| <1,8>
|| [ A1] LDW .D2T2 *B5++(8),B11 ; |64| <3,0>
|| [ A1] LDW .D1T1 *A10++(8),A4 ; |65| <3,0>

ADD .L1 A5,A7,A4 ; |79| <0,13>
|| [ B0] ADD .S2 0xffffffff,B0,B0 ; |88| <1,9>
|| MPYHL .M1 A6,A11,A5 ; |70| <1,9>
|| MPYHL .M2 B9,B12,B2 ; |75| <1,9>
|| MV .S1 A4,A11 ; |65| <2,5> Split a long life
|| [ A1] LDW .D1T1 *A9++(8),A6 ; |63| <3,1>
|| [ A1] LDW .D2T2 *B6++(8),B8 ; |66| <3,1>

[!A2] STH .D1T2 B7,*-A8(4) ; |85| <0,14>
|| [!A2] STH .D2T1 A4,*B4++(8) ; |84| <0,14>
|| ADD .L2 B2,B13,B7 ; |81| <0,14>
|| ADD .S1 A0,A3,A0 ; |78| <1,10>
|| [ B0] B .S2 LOOP ; |89| <1,10>
|| MPY .M2 B11,B8,B10 ; |73| <2,6>
|| MPYH .M1 A6,A11,A3 ; |69| <2,6>

[ B1] SUB .S2 B1,1,B1 ; <0,15>
|| [ A2] SUB .S1 A2,1,A2 ; <0,15>
|| [ A1] SUB .L1 A1,1,A1 ; <0,15>
|| [!A2] STH .D2T2 B7,*-B4(4) ; |86| <0,15>
|| ADD .L2 B10,B3,B7 ; |80| <1,11>
|| [!B1] STH .D1T1 A0,*A8++(8) ; |83| <1,11>
|| MPYH .M2 B11,B8,B3 ; |74| <2,7>
|| MPY .M1 A6,A11,A0 ; |68| <2,7 Once again the compiler has collapsed 2 epilog and 2 prolog
stages. It also has 100% multiplier utilization and is hence
100% optimal. I am attching all code samples for use and
verification of others. Now aint, she beautiful.

CAUTION:
This loop is simple, not all loops will end up with 100% optimal
code, but still it is worth a try.

Regards
Jagadeesh Sankaran



Attachment (not stored)
opt.c
Type: TEXT/x-sun-c-file

;*************************************************************************** ***
;* TMS320C6x C/C++ Codegen Unix Version 4.31 *
;* Date/Time created: Fri Nov 22 10:45:37 2002 *
;******************************************************************************< br />
;******************************************************************************< br /> ;* GLOBAL FILE PARAMETERS *
;* *
;* Architecture : TMS320C620x *
;* Optimization : Enabled at level 2 *
;* Optimizing for : Speed *
;* Based on options: -o2, no -ms *
;* Endian : Little *
;* Interrupt Thrshld : Disabled *
;* Memory Model : Small *
;* Calls to RTS : Near *
;* Pipelining : Enabled *
;* Speculative Load : Disabled *
;* Memory Aliases : Presume not aliases (optimistic) *
;* Debug Info : No Debug Info *
;* *
;******************************************************************************< br />
.asg A15, FP
.asg B14, DP
.asg B15, SP
.global $bss

; opt6x -t -v6200 -O2 /var/tmp/aaaa004RJ /var/tmp/daaa004RJ

.sect ".text"
.global _foo_cn

;******************************************************************************< br /> ;* FUNCTION NAME: _foo_cn *
;* *
;* Regs Modified : A0,A1,A2,A3,A4,A5,A6,A7,B0,B4,B5,B6,B7,B8,B9 *
;* Regs Used : A0,A1,A2,A3,A4,A5,A6,A7,B0,B3,B4,B5,B6,B7,B8,B9 *
;* Local Frame Size : 0 Args + 0 Auto + 0 Save = 0 byte *
;******************************************************************************< br /> _foo_cn:
;** --------------------------------------------------------------------------*
CMPGT .L2 B4,0,B0 ; |5|
[!B0] RET .S2 B3 ; |10|
ADD .D2 B4,B4,B5
MV .S1X B5,A0 ; Define a twin register
ADDAH .D2 B6,B5,B6
MV .S2X A4,B7 ; |2|

ADD .S2 4,B6,B5
|| ADDAH .D2 B7,B5,B4
|| ADDAH .D1 A6,A0,A0
|| MV .S1X B4,A5 ; |2|

; BRANCH OCCURS ; |10| ; chained return
;*----------------------------------------------------------------------------*< br /> ;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 5
;* Loop opening brace source line : 6
;* Loop closing brace source line : 9
;* Known Minimum Trip Count : 1
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 1
;* Unpartitioned Resource Bound : 3
;* Partitioned Resource Bound(*) : 3
;* Resource Partition:
;* A-side B-side
;* .L units 0 0
;* .S units 1 0
;* .D units 3* 3*
;* .M units 2 2
;* .X cross paths 2 2
;* .T address paths 3* 3*
;* Long read paths 1 1
;* Long write paths 0 0
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 1 2 (.L or .S or .D unit)
;* Bound(.L .S .LS) 1 0
;* Bound(.L .S .D .LS .LSD) 2 2
;*
;* Searching for software pipeline schedule at ...
;* ii = 3 Schedule found with 4 iterations in parallel
;*
;* Register Usage Table:
;* +---------------------------------+
;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
;* |0000000000111111|0000000000111111|
;* |0123456789012345|0123456789012345|
;* |----------------+----------------|
;* 0: |*** **** |* ****** |
;* 1: |***** ** |* *** ** |
;* 2: |******** |* *** ** |
;* +---------------------------------+
;*
;* Done
;*
;* Collapsed epilog stages : 3
;* Prolog not entirely removed
;* Collapsed prolog stages : 2
;*
;* Minimum required memory pad : 0 bytes
;*
;* For further improvement on this loop, try option -mh12
;*
;* Minimum safe trip count : 1
;*----------------------------------------------------------------------------*< br /> ;* SETUP CODE
;*
;* MV A4,B5
;* ADD 2,B5,B5
;*
;* SINGLE SCHEDULED ITERATION
;*
;* C30:
;* 0 LDH .D1T1 *+A0(2),A5 ; |7| ^
;* || LDH .D2T2 *+B4(2),B9 ; |7| ^
;* 1 LDH .D1T1 *A0--(4),A6 ; |7| ^
;* || LDH .D2T2 *B4--(4),B7 ; |7| ^
;* 2 NOP 3
;* 5 MPY .M1X B9,A5,A3 ; |7| ^
;* || [ B0] SUB .S2 B0,1,B0 ; |9| ^
;* 6 MPY .M1X B7,A6,A6 ; |7| ^
;* || MPY .M2X B7,A5,B8 ; |8| ^
;* || [ B0] B .S1 C30 ; |9|
;* 7 MPY .M2X B9,A6,B6 ; |8| ^
;* 8 ADD .L1 A3,A6,A7 ; |7| ^
;* 9 NOP 1
;* 10 SUB .S2 B6,B8,B6 ; |8| ^
;* 11 STH .D1T1 A7,*--A4(4) ; |7| ^
;* || STH .D2T2 B6,*--B5(4) ; |8| ^
;* ; BRANCH OCCURS ; |9|
;*----------------------------------------------------------------------------*< br /> L1: ; PIPED LOOP PROLOG

SUB .L1 A5,1,A1
|| MV .S2X A5,B0
|| LDH .D2T2 *+B4(2),B9 ; |7| (P) <0,0> ^
|| B .S1 L2 ; |9| (P) <0,6>
|| LDH .D1T1 *+A0(2),A5 ; |7| (P) <0,0> ^

ZERO .S1 A3
|| LDH .D1T1 *A0--(4),A6 ; |7| (P) <0,1> ^
|| LDH .D2T2 *B4--(4),B7 ; |7| (P) <0,1> ^

MV .L1X B5,A4
|| ADD .D2 2,B5,B5
|| SET .S1 A3,0xf,0xf,A2 ; init prolog collapse predicate

;** --------------------------------------------------------------------------*
L2: ; PIPED LOOP KERNEL

MPY .M2X B7,A5,B8 ; |8| <1,6> ^
|| MPY .M1X B7,A6,A6 ; |7| <1,6> ^
|| [ B0] B .S1 L2 ; |9| <1,6>
|| [ A1] LDH .D2T2 *+B4(2),B9 ; |7| <3,0> ^
|| [ A1] LDH .D1T1 *+A0(2),A5 ; |7| <3,0> ^

[ A2] MPYSU .M1 2,A2,A2 ; <0,10>
|| SUB .S2 B6,B8,B6 ; |8| <0,10> ^
|| MPY .M2X B9,A6,B6 ; |8| <1,7> ^
|| [ A1] LDH .D1T1 *A0--(4),A6 ; |7| <3,1> ^
|| [ A1] LDH .D2T2 *B4--(4),B7 ; |7| <3,1> ^

[ A1] SUB .S1 A1,1,A1 ; <0,11>
|| [!A2] STH .D1T1 A7,*--A4(4) ; |7| <0,11> ^
|| [!A2] STH .D2T2 B6,*--B5(4) ; |8| <0,11> ^
|| ADD .L1 A3,A6,A7 ; |7| <1,8> ^
|| [ B0] SUB .S2 B0,1,B0 ; |9| <2,5> ^
|| MPY .M1X B9,A5,A3 ; |7| <2,5> ^

;** --------------------------------------------------------------------------*
L3: ; PIPED LOOP EPILOG
;** --------------------------------------------------------------------------*
RET .S2 B3 ; |10|
NOP 5
; BRANCH OCCURS ; |10|
.sect ".text"
.global _foo_co

;******************************************************************************< br /> ;* FUNCTION NAME: _foo_co *
;* *
;* Regs Modified : A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,B0,B1,B2,B3,B4,B5,B6, *
;* B7,B8,B9,B10,SP *
;* Regs Used : A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,B0,B1,B2,B3,B4,B5,B6, *
;* B7,B8,B9,B10,DP,SP *
;* Local Frame Size : 0 Args + 0 Auto + 4 Save = 4 byte *
;******************************************************************************< br /> _foo_co:
;** --------------------------------------------------------------------------*
SHL .S2 B4,2,B5

STW .D2T2 B10,*SP--(8) ; |14|
|| MVC .S2 CSR,B10

ADD .S1X B5,A6,A0
|| ADD .D2 B5,B6,B7
|| ADD .S2X B5,A4,B5
|| AND .L2 -2,B10,B6

SUB .D1 A0,4,A4
|| SUB .D2 B5,4,B5
|| SUB .L2 B7,4,B7
|| SHR .S2 B4,1,B4 ; |25|

MVC .S2 B6,CSR ; interrupts off
|| LDW .D2T2 *+B5(4),B6 ; |25| (P) <0,0>
|| LDW .D1T1 *+A4(4),A3 ; |25| (P) <0,0>

;*----------------------------------------------------------------------------*< br /> ;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 23
;* Loop opening brace source line : 24
;* Loop closing brace source line : 27
;* Loop Unroll Multiple : 2x
;* Known Minimum Trip Count : 32
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 0
;* Unpartitioned Resource Bound : 4
;* Partitioned Resource Bound(*) : 5
;* Resource Partition:
;* A-side B-side
;* .L units 1 0
;* .S units 0 1
;* .D units 4 4
;* .M units 4 4
;* .X cross paths 5* 5*
;* .T address paths 4 4
;* Long read paths 2 2
;* Long write paths 0 0
;* Logical ops (.LS) 2 1 (.L or .S unit)
;* Addition ops (.LSD) 1 2 (.L or .S or .D unit)
;* Bound(.L .S .LS) 2 1
;* Bound(.L .S .D .LS .LSD) 3 3
;*
;* Searching for software pipeline schedule at ...
;* ii = 5 Schedule found with 3 iterations in parallel
;*
;* Register Usage Table:
;* +---------------------------------+
;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
;* |0000000000111111|0000000000111111|
;* |0123456789012345|0123456789012345|
;* |----------------+----------------|
;* 0: |** *** * * |*** *** |
;* 1: |** ******* |*** ***** |
;* 2: |** ** **** |********** |
;* 3: |** ******* |** ** *** |
;* 4: |** ** ** * |* **** * |
;* +---------------------------------+
;*
;* Done
;*
;* Epilog not removed
;* Collapsed epilog stages : 0
;*
;* Prolog not entirely removed
;* Collapsed prolog stages : 1
;*
;* Minimum required memory pad : 0 bytes
;*
;* Minimum safe trip count : 2 (after unrolling)
;*----------------------------------------------------------------------------*< br /> ;* SETUP CODE
;*
;* MV A7,B4
;* ADD 4,B4,B4
;*
;* SINGLE SCHEDULED ITERATION
;*
;* C54:
;* 0 LDW .D2T2 *+B5(4),B6 ; |25|
;* || LDW .D1T1 *+A4(4),A3 ; |25|
;* 1 LDW .D2T2 *B5--(8),B7 ; |25|
;* || LDW .D1T1 *A4--(8),A6 ; |25|
;* 2 NOP 3
;* 5 MV .S1X B6,A8 ; |25| Define a twin
register
;* || MPYHL .M2X B6,A3,B9 ; |26|
;* 6 MPYHL .M1X B7,A6,A5 ; |26|
;* 7 MPY .M1 A8,A3,A0 ; |25|
;* || MPYHL .M2X A3,B6,B6 ; |26|
;* 8 MPYH .M1 A8,A3,A5 ; |25|
;* || MPYH .M2X B7,A6,B2 ; |25|
;* || MV .S1 A5,A9 ; |26| Split a long life
;* || [ B0] SUB .S2 B0,1,B0 ; |27|
;* 9 SUB .D2 B9,B6,B1 ; |26|
;* || MPY .M1X B7,A6,A0 ; |25|
;* || MPYHL .M2X A6,B7,B8 ; |26|
;* || [ B0] B .S2 C54 ; |27|
;* 10 ADD .L1 A5,A0,A5 ; |25|
;* 11 MV .S2X A5,B3 ; |25| Define a twin
register
;* 12 STH .D2T2 B3,*B4--(8) ; |25|
;* || ADD .S1X B2,A0,A0 ; |25|
;* 13 STH .D2T2 B1,*+B4(10) ; |26|
;* || STH .D1T1 A0,*A7--(8) ; |25|
;* || SUB .L1X A9,B8,A3 ; |26|
;* 14 STH .D1T1 A3,*+A7(10) ; |26|
;* ; BRANCH OCCURS ; |27|
;*----------------------------------------------------------------------------*< br /> L4: ; PIPED LOOP PROLOG

LDW .D1T1 *A4--(8),A6 ; |25| (P) <0,1>
|| LDW .D2T2 *B5--(8),B7 ; |25| (P) <0,1>

NOP 1
MV .S1X B3,A2 ; |14|

MVK .S1 0x1,A1 ; init prolog collapse predicate
|| SUB .D2 B4,2,B0
|| ADD .L2 4,B7,B4
|| MV .L1X B7,A7
|| B .S2 L5 ; |27| (P) <0,9>

;** --------------------------------------------------------------------------*
L5: ; PIPED LOOP KERNEL

ADD .L1 A5,A0,A5 ; |25| <0,10>
|| MPYHL .M2X B6,A3,B9 ; |26| <1,5>
|| MV .S1X B6,A8 ; |25| <1,5> Define a twin register
|| LDW .D1T1 *+A4(4),A3 ; |25| <2,0>
|| LDW .D2T2 *+B5(4),B6 ; |25| <2,0>

MV .S2X A5,B3 ; |25| <0,11> Define a twin
register
|| MPYHL .M1X B7,A6,A5 ; |26| <1,6>
|| LDW .D1T1 *A4--(8),A6 ; |25| <2,1>
|| LDW .D2T2 *B5--(8),B7 ; |25| <2,1>

ADD .S1X B2,A0,A0 ; |25| <0,12>
|| [!A1] STH .D2T2 B3,*B4--(8) ; |25| <0,12>
|| MPYHL .M2X A3,B6,B6 ; |26| <1,7>
|| MPY .M1 A8,A3,A0 ; |25| <1,7>

SUB .L1X A9,B8,A3 ; |26| <0,13>
|| [!A1] STH .D1T1 A0,*A7--(8) ; |25| <0,13>
|| [!A1] STH .D2T2 B1,*+B4(10) ; |26| <0,13>
|| MV .S1 A5,A9 ; |26| <1,8> Split a long life
|| [ B0] SUB .S2 B0,1,B0 ; |27| <1,8>
|| MPYH .M1 A8,A3,A5 ; |25| <1,8>
|| MPYH .M2X B7,A6,B2 ; |25| <1,8>

[ A1] SUB .S1 A1,1,A1 ; <0,14>
|| [!A1] STH .D1T1 A3,*+A7(10) ; |26| <0,14>
|| MPYHL .M2X A6,B7,B8 ; |26| <1,9>
|| SUB .D2 B9,B6,B1 ; |26| <1,9>
|| [ B0] B .S2 L5 ; |27| <1,9>
|| MPY .M1X B7,A6,A0 ; |25| <1,9>

;** --------------------------------------------------------------------------*
L6: ; PIPED LOOP EPILOG

ADD .D1 A5,A0,A4 ; |25| (E) <1,10>
|| MV .S1X B6,A8 ; |25| (E) <2,5> Define a twin
register
|| MPYHL .M2X B6,A3,B9 ; |26| (E) <2,5>

MV .S2X A4,B5 ; |25| (E) <1,11> Define a twin
register
|| MPYHL .M1X B7,A6,A0 ; |26| (E) <2,6>

ADD .S1X B2,A0,A4 ; |25| (E) <1,12>
|| MPYHL .M2X A3,B6,B6 ; |26| (E) <2,7>
|| MPY .M1 A8,A3,A0 ; |25| (E) <2,7>
|| STH .D2T2 B5,*B4--(8) ; |25| (E) <1,12>

SUB .L1X A9,B8,A3 ; |26| (E) <1,13>
|| STH .D2T2 B1,*+B4(10) ; |26| (E) <1,13>
|| MV .S1 A0,A9 ; |26| (E) <2,8> Split a long life
|| MPYH .M1 A8,A3,A3 ; |25| (E) <2,8>
|| MPYH .M2X B7,A6,B2 ; |25| (E) <2,8>
|| STH .D1T1 A4,*A7--(8) ; |25| (E) <1,13>

STH .D1T1 A3,*+A7(10) ; |26| (E) <1,14>
|| SUB .D2 B9,B6,B1 ; |26| (E) <2,9>
|| MPY .M1X B7,A6,A0 ; |25| (E) <2,9>
|| MPYHL .M2X A6,B7,B8 ; |26| (E) <2,9>

MVC .S2 B10,CSR ; interrupts on
|| ADD .D1 A3,A0,A3 ; |25| (E) <2,10>

ADD .S1X B2,A0,A0 ; |25| (E) <2,12>
|| MV .S2X A3,B5 ; |25| (E) <2,11> Define a twin
register

SUB .L1X A9,B8,A3 ; |26| (E) <2,13>
|| STH .D2T2 B5,*B4--(8) ; |25| (E) <2,12>

STH .D2T2 B1,*+B4(10) ; |26| (E) <2,13>
|| STH .D1T1 A0,*A7--(8) ; |25| (E) <2,13>

RET .S2X A2 ; |28|
|| STH .D1T1 A3,*+A7(10) ; |26| (E) <2,14>

LDW .D2T2 *++SP(8),B10 ; |28|
NOP 4
; BRANCH OCCURS ; |28|
.sect ".text"
.global _foo_c

;******************************************************************************< br /> ;* FUNCTION NAME: _foo_c *
;* *
;* Regs Modified : A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,B0,B1,B2, *
;* B3,B4,B5,B6,B7,B8,B9,B10,B13,SP *
;* Regs Used : A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,B0,B1,B2, *
;* B3,B4,B5,B6,B7,B8,B9,B10,B13,DP,SP *
;* Local Frame Size : 0 Args + 0 Auto + 20 Save = 20 byte *
;******************************************************************************< br /> _foo_c:
;** --------------------------------------------------------------------------*
CMPLT .L2 B4,0,B0 ; |47|
[!B0] B .S1 L10 ; |47|
MV .S1X SP,A9 ; |33|
STW .D2T2 B10,*SP--(24) ; |33|
STW .D1T1 A11,*-A9(12)

STW .D1T1 A12,*-A9(8)
|| STW .D2T2 B13,*+SP(20)
|| SHR .S2 B4,1,B1

STW .D2T1 A10,*+SP(8)
|| MV .S1X B6,A11 ; |33|
|| MV .S2 B3,B13
|| ZERO .L2 B4 ; |47|
|| MV .D1 A4,A3 ; |33|

; BRANCH OCCURS ; |47|
;** --------------------------------------------------------------------------*
MV .S2X A6,B8

MV .S1 A3,A0
|| LDW .D1T2 *A3++(8),B7 ; |70| (P) <0,0> ^
|| LDW .D2T1 *B8++(8),A9 ; |70| (P) <0,0> ^

NOP 1

MV .L2X A6,B5
|| MVC .S2 CSR,B10

AND .S2 -2,B10,B5
|| LDW .D1T1 *+A0(4),A10 ; |52| (P) <0,3>
|| LDW .D2T2 *+B5(4),B9 ; |53| (P) <0,3>

ADD .D2 2,B4,B4 ; |74| (P) <0,4>
|| MVC .S2 B5,CSR ; interrupts off

ZERO .D2 B5
|| ZERO .D1 A8
|| MVK .S1 0x1,A1
|| MV .L1 A9,A0 ; |70| (P) <0,5> Split a long life
|| MPYHL .M2X A9,B7,B2 ; |70| (P) <0,5>
|| CMPGT .L2 B4,B1,B0 ; |74| (P) <0,5>

MV .D1 A1,A2
|| [ A1] MV .S1X B7,A8 ; |70| (P) <0,6> ^ Define a twin
register
|| [!B0] ZERO .L1 A1 ; (P) <0,6> ^
|| [ A1] MV .S2X A9,B5 ; |70| (P) <0,6> ^ Define a twin
register

MV .S1 A3,A7 ; |70| (P) <0,3> Split a long life
|| MV .S2 B8,B6 ; |70| (P) <0,3> Split a long life
|| MPY .M1 A8,A0,A4 ; |70| (P) <0,7>
|| [ A1] LDW .D1T2 *A3++(8),B7 ; |70| (P) <1,0> ^
|| [ A1] LDW .D2T1 *B8++(8),A9 ; |70| (P) <1,0> ^
|| MPYHL .M2 B7,B5,B0 ; |70| (P) <0,7>

;*----------------------------------------------------------------------------*< br /> ;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop source line : 47
;* Loop opening brace source line : 48
;* Loop closing brace source line : 74
;* Known Minimum Trip Count : 1
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 6
;* Unpartitioned Resource Bound : 5
;* Partitioned Resource Bound(*) : 6
;* Resource Partition:
;* A-side B-side
;* .L units 1 1
;* .S units 2 3
;* .D units 4 2
;* .M units 4 4
;* .X cross paths 6* 5
;* .T address paths 4 2
;* Long read paths 2 0
;* Long write paths 0 0
;* Logical ops (.LS) 3 2 (.L or .S unit)
;* Addition ops (.LSD) 8 3 (.L or .S or .D unit)
;* Bound(.L .S .LS) 3 3
;* Bound(.L .S .D .LS .LSD) 6* 4
;*
;* Searching for software pipeline schedule at ...
;* ii = 6 Did not find schedule
;* ii = 7 Schedule found with 3 iterations in parallel
;*
;* Register Usage Table:
;* +---------------------------------+
;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
;* |0000000000111111|0000000000111111|
;* |0123456789012345|0123456789012345|
;* |----------------+----------------|
;* 0: | *********** |*** ****** |
;* 1: |************ |********** |
;* 2: |****** ****** |*** ****** |
;* 3: |************* |*** ****** |
;* 4: |************* |*** ****** |
;* 5: |************ |** ****** |
;* 6: |************ |*** ****** |
;* +---------------------------------+
;*
;* Done
;*
;* Collapsed epilog stages : 2
;* Prolog not removed
;* Collapsed prolog stages : 0
;*
;* Minimum required memory pad : 0 bytes
;*
;* For further improvement on this loop, try option -mh112
;*
;* Minimum safe trip count : 1
;*----------------------------------------------------------------------------*< br /> ;* SETUP CODE
;*
;* MVK 0x1,A1
;* ZERO A8
;* ZERO A9
;* ZERO B9
;* ZERO A10
;* MV A8,B7
;* MV A9,B5
;* MV B8,B6
;* MV A3,A7
;* MV A1,A2
;*
;* SINGLE SCHEDULED ITERATION
;*
;* C114:
;* 0 [ A1] LDW .D2T1 *B8++(8),A9 ; |70| ^
;* || [ A1] LDW .D1T2 *A3++(8),B7 ; |70| ^
;* 1 NOP 2
;* 3 [ A1] LDW .D1T1 *+A7(4),A10 ; |52|
;* || [ A1] LDW .D2T2 *+B6(4),B9 ; |53|
;* || MV .L2 B8,B6 ; |70| Split a long life
;* || MV .S1 A3,A7 ; |70| Split a long life
;* 4 ADD .D2 2,B4,B4 ; |74|
;* 5 MV .D1 A9,A0 ; |70| Split a long life
;* || MPYHL .M2X A9,B7,B0 ; |70|
;* || CMPGT .L2 B4,B1,B0 ; |74|
;* 6 [ A1] MV .S2X A9,B5 ; |70| ^ Define a twin
register
;* || [ A1] MV .L1X B7,A8 ; |70| ^ Define a twin
register
;* || [!B0] ZERO .D1 A1 ; ^
;* 7 MPYHL .M2 B7,B5,B2 ; |70|
;* || MPY .M1 A8,A0,A4 ; |70|
;* 8 MPYHL .M2X A10,B9,B2 ; |71|
;* || MPYH .M1X A10,B9,A6 ; |71|
;* 9 SUB .D2 B2,B0,B0 ; |70|
;* || MPYH .M1 A8,A0,A0 ; |70|
;* || MPY .M2X A10,B9,B0 ; |71|
;* || MV .S1 A1,A5 ; Split a long life
;* 10 SHL .S2 B0,16,B3 ; |70|
;* || MPYHL .M1X B9,A10,A0 ; |71|
;* 11 MV .S1X B3,A12 ; |70| Define a twin
register
;* || ADD .L1 A0,A4,A4 ; |70|
;* || ADD .D1 8,A11,A11 ; |70|
;* || ADD .L2X A6,B0,B0 ; |71|
;* || [ A1] B .S2 C114 ; |74|
;* 12 EXTU .S1 A4,16,16,A6 ; |70|
;* || SUB .L1X B2,A0,A4 ; |71|
;* || EXTU .S2 B0,16,16,B2 ; |71|
;* 13 SHL .S1 A4,16,A4 ; |71|
;* 14 ADD .L1 A6,A12,A4 ; |70|
;* || ADD .S1X B2,A4,A6 ; |71|
;* 15 [ A2] STW .D1T1 A4,*-A11(8) ; |70|
;* 16 [ A2] STW .D1T1 A6,*-A11(4) ; |71|
;* || MV .L1 A5,A2 ; Split a long life
;* ; BRANCH OCCURS ; |74|
;*----------------------------------------------------------------------------*< br /> L7: ; PIPED LOOP PROLOG

MPYHL .M2X A10,B9,B2 ; |71| (P) <0,8>
|| MPYH .M1X A10,B9,A6 ; |71| (P) <0,8>

MV .D1 A1,A5 ; (P) <0,9> Split a long life
|| MPY .M2X A10,B9,B0 ; |71| (P) <0,9>
|| SUB .D2 B0,B2,B0 ; |70| (P) <0,9>
|| MPYH .M1 A8,A0,A0 ; |70| (P) <0,9>

;** --------------------------------------------------------------------------*
L8: ; PIPED LOOP KERNEL

MPYHL .M1X B9,A10,A0 ; |71| <0,10>
|| SHL .S2 B0,16,B3 ; |70| <0,10>
|| MV .L2 B8,B6 ; |70| <1,3> Split a long life
|| MV .S1 A3,A7 ; |70| <1,3> Split a long life
|| [ A1] LDW .D1T1 *+A7(4),A10 ; |52| <1,3>
|| [ A1] LDW .D2T2 *+B6(4),B9 ; |53| <1,3>

ADD .D1 8,A11,A11 ; |70| <0,11>
|| MV .S1X B3,A12 ; |70| <0,11> Define a twin
register
|| ADD .L2X A6,B0,B0 ; |71| <0,11>
|| [ A1] B .S2 L8 ; |74| <0,11>
|| ADD .L1 A0,A4,A4 ; |70| <0,11>
|| ADD .D2 2,B4,B4 ; |74| <1,4>

SUB .L1X B2,A0,A4 ; |71| <0,12>
|| EXTU .S1 A4,16,16,A6 ; |70| <0,12>
|| EXTU .S2 B0,16,16,B2 ; |71| <0,12>
|| CMPGT .L2 B4,B1,B0 ; |74| <1,5>
|| MV .D1 A9,A0 ; |70| <1,5> Split a long life
|| MPYHL .M2X A9,B7,B0 ; |70| <1,5>

SHL .S1 A4,16,A4 ; |71| <0,13>
|| [ A1] MV .L1X B7,A8 ; |70| <1,6> ^ Define a twin
register
|| [!B0] ZERO .D1 A1 ; <1,6> ^
|| [ A1] MV .S2X A9,B5 ; |70| <1,6> ^ Define a twin
register

ADD .L1 A6,A12,A4 ; |70| <0,14>
|| ADD .S1X B2,A4,A6 ; |71| <0,14>
|| MPYHL .M2 B7,B5,B2 ; |70| <1,7>
|| MPY .M1 A8,A0,A4 ; |70| <1,7>
|| [ A1] LDW .D1T2 *A3++(8),B7 ; |70| <2,0> ^
|| [ A1] LDW .D2T1 *B8++(8),A9 ; |70| <2,0> ^

[ A2] STW .D1T1 A4,*-A11(8) ; |70| <0,15>
|| MPYHL .M2X A10,B9,B2 ; |71| <1,8>
|| MPYH .M1X A10,B9,A6 ; |71| <1,8>

MV .L1 A5,A2 ; <0,16> Split a long life
|| [ A2] STW .D1T1 A6,*-A11(4) ; |71| <0,16>
|| MV .S1 A1,A5 ; <1,9> Split a long life
|| MPYH .M1 A8,A0,A0 ; |70| <1,9>
|| MPY .M2X A10,B9,B0 ; |71| <1,9>
|| SUB .D2 B2,B0,B0 ; |70| <1,9>

;** --------------------------------------------------------------------------*
L9: ; PIPED LOOP EPILOG
;** --------------------------------------------------------------------------*
NOP 1
MVC .S2 B10,CSR ; interrupts on
;** --------------------------------------------------------------------------*
L10:

LDW .D2T1 *+SP(8),A10 ; |75|
|| MV .S2 B13,B3 ; |75|
|| MV .S1X SP,A9 ; |75|

RET .S2 B3 ; |75|
|| LDW .D1T1 *+A9(12),A11 ; |75|
|| LDW .D2T2 *+SP(20),B13 ; |75|

LDW .D2T2 *++SP(24),B10 ; |75|
|| LDW .D1T1 *+A9(16),A12 ; |75|

NOP 4
; BRANCH OCCURS ; |75|


;void foo_c(const short *restrict x, int N, const short *restrict y, short
*restrict res)
;{
; int k;
; int xword0;
; int yword0;
; int xword1;
; int yword1;
;
; int xt0;
; int yt0;
; int oword0;
; int xt1;
; int yt1;
; int oword1;
;
; for(k = 0; k > (N >> 1);k+= 2)
; {
;
; xword0 = _amem4_const(&x[2*k]);
; yword0 = _amem4_const(&y[2*k]);
; xword1 = _amem4_const(&x[2*k + 2]);
; yword1 = _amem4_const(&y[2*k + 2]);
;
; /*---------------------------------------------------------*/
; /* res[2*k] = x[2*k]*y[2*k] + x[2*k+1] * y[2*k+1]; */
; /* res[2*k+1] = x[2*k+1]*y[2*k] - x[2*k] * y[2*k+1]; */
; /*---------------------------------------------------------*/
;
; xt0 = _mpy(xword0, yword0) + _mpyh(xword0, yword0);
; yt0 = _mpyhl(xword0, yword0) - _mpylh(xword0, yword0);
; xt0 = _mpy(xword1, yword1) + _mpyh(xword1, yword1);
; yt0 = _mpyhl(xword1, yword1) - _mpylh(xword1, yword1);
;
; oword0 = (yt0 << 16) + (xt0 & 0xFFFF);
; oword1 = (yt1 << 16) + (xt1 & 0xFFFF);
; _amem4(&res[2*k]) = oword0;
; _amem4(&res[2*k+2]) = oword1;
; }
;}

.global _foo_sa
foo_sa: .cproc A_x, B_n, A_y, B_res

.reg A_xp, B_xp, A_yp, B_yp
.reg A_rp, B_rp, A_xw0, B_xw1
.reg A_yw0, B_yw1, A_rw0, B_rw1

.reg A_xt00, A_xt01, A_yt00, A_yt01
.reg B_xt10, B_xt11, B_yt10, B_yt11
.reg A_xt0, A_yt0, B_xt1, B_yt1 MV A_x, A_xp
ADD A_x, 4, B_xp

MV A_y, A_yp
ADD A_y, 4, B_yp

MV B_res, A_rp
ADD B_res, 2, B_rp

LOOP:
LDW.D1T1 *A_xp++[2], A_xw0
LDW.D2T2 *B_xp++[2], B_xw1
LDW.D1T1 *A_yp++[2], A_yw0
LDW.D2T2 *B_yp++[2], B_yw1

MPY.1 A_xw0, A_yw0, A_xt00
MPYH.1 A_xw0, A_yw0, A_xt01
MPYHL.1 A_xw0, A_yw0, A_yt00
MPYLH.1 A_xw0, A_yw0, A_yt01

MPY.2 B_xw1, B_yw1, B_xt10
MPYH.2 B_xw1, B_yw1, B_xt11
MPYHL.2 B_xw1, B_yw1, B_yt10
MPYLH.2 B_xw1, B_yw1, B_yt11

ADD.1 A_xt00, A_xt01, A_xt0
ADD.1 A_yt00, A_yt01, A_yt0
ADD.2 B_xt10, B_xt11, B_xt1
ADD.2 B_yt10, B_yt11, B_yt1

STH.D1T1 A_xt0, *A_rp++[2]
STH.D2T1 A_yt0, *B_rp++[2]
STH.D1T2 B_xt1, *A_rp++[2]
STH.D2T2 B_yt1, *B_rp++[2]

[B_n] SUB.2 B_n, 1, B_n
[B_n] B.2 LOOP

.return
.endproc


;*************************************************************************** ***
;* TMS320C6x C/C++ Codegen Unix Version 4.31 *
;* Date/Time created: Fri Nov 22 10:14:29 2002 *
;******************************************************************************< br />
;******************************************************************************< br /> ;* GLOBAL FILE PARAMETERS *
;* *
;* Architecture : TMS320C620x *
;* Optimization : Enabled at level 2 *
;* Optimizing for : Speed *
;* Based on options: -o2, no -ms *
;* Endian : Little *
;* Interrupt Thrshld : Disabled *
;* Memory Model : Small *
;* Calls to RTS : Near *
;* Pipelining : Enabled *
;* Speculative Load : Disabled *
;* Memory Aliases : Presume not aliases (optimistic) *
;* Debug Info : No Debug Info *
;* *
;******************************************************************************< br />
.asg A15, FP
.asg B14, DP
.asg B15, SP
.global $bss

;void foo_c(const short *restrict x, int N, const short *restrict y, short
*restrict res)
;{
; int k;
; int xword0;
; int yword0;
; int xword1;
; int yword1;
;
; int xt0;
; int yt0;
; int oword0;
; int xt1;
; int yt1;
; int oword1;
;
; for(k = 0; k > (N >> 1);k+= 2)
; {
;
; xword0 = _amem4_const(&x[2*k]);
; yword0 = _amem4_const(&y[2*k]);
; xword1 = _amem4_const(&x[2*k + 2]);
; yword1 = _amem4_const(&y[2*k + 2]);
;
; /*---------------------------------------------------------*/
; /* res[2*k] = x[2*k]*y[2*k] + x[2*k+1] * y[2*k+1]; */
; /* res[2*k+1] = x[2*k+1]*y[2*k] - x[2*k] * y[2*k+1]; */
; /*---------------------------------------------------------*/
;
; xt0 = _mpy(xword0, yword0) + _mpyh(xword0, yword0);
; yt0 = _mpyhl(xword0, yword0) - _mpylh(xword0, yword0);
; xt0 = _mpy(xword1, yword1) + _mpyh(xword1, yword1);
; yt0 = _mpyhl(xword1, yword1) - _mpylh(xword1, yword1);
;
; oword0 = (yt0 << 16) + (xt0 & 0xFFFF);
; oword1 = (yt1 << 16) + (xt1 & 0xFFFF);
; _amem4(&res[2*k]) = oword0;
; _amem4(&res[2*k+2]) = oword1;
; }
;}

.global _foo_sa

.sect ".text"

;******************************************************************************< br /> ;* FUNCTION NAME: foo_sa *
;* *
;* Regs Modified : A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A14,B0,B1, *
;* B2,B3,B4,B5,B6,B7,B8,B9,B10,B11,B12,B13,SP *
;* Regs Used : A0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A14,B0,B1, *
;* B2,B3,B4,B5,B6,B7,B8,B9,B10,B11,B12,B13,DP,SP *
;******************************************************************************< br /> foo_sa:
;** --------------------------------------------------------------------------*
; foo_sa: .cproc A_x, B_n, A_y, B_res
; .reg A_xp, B_xp, A_yp, B_yp
; .reg A_rp, B_rp, A_xw0, B_xw1
; .reg A_yw0, B_yw1, A_rw0, B_rw1
; .reg A_xt00, A_xt01, A_yt00, A_yt01
; .reg B_xt10, B_xt11, B_yt10, B_yt11
; .reg A_xt0, A_yt0, B_xt1, B_yt1

MV .S1X SP,A0 ; |42|
|| STW .D2T2 B13,*SP--(32) ; |42|
|| MVC .S2 CSR,B8

MV .L2 B4,B0 ; |42|
|| STW .D1T1 A12,*-A0(20)
|| STW .D2T2 B12,*+SP(28)
|| AND .S2 -2,B8,B7

MV .S1X B6,A8 ; |42|
|| ADD .L2X 0x4,A4,B5 ; |54|
|| STW .D1T1 A10,*-A0(28)
|| MV .L1 A6,A10
|| STW .D2T2 B11,*+SP(24)
|| MVC .S2 B7,CSR ; interrupts off

;*----------------------------------------------------------------------------*< br /> ;* SOFTWARE PIPELINE INFORMATION
;*
;* Loop label : LOOP
;* Loop source line : 63
;* Loop closing brace source line : 89
;* Known Minimum Trip Count : 1
;* Known Max Trip Count Factor : 1
;* Loop Carried Dependency Bound(^) : 0
;* Unpartitioned Resource Bound : 4
;* Partitioned Resource Bound(*) : 4
;* Resource Partition:
;* A-side B-side
;* .L units 0 0
;* .S units 0 1
;* .D units 4* 4*
;* .M units 4* 4*
;* .X cross paths 0 0
;* .T address paths 4* 4*
;* Long read paths 2 2
;* Long write paths 0 0
;* Logical ops (.LS) 0 0 (.L or .S unit)
;* Addition ops (.LSD) 2 3 (.L or .S or .D unit)
;* Bound(.L .S .LS) 0 1
;* Bound(.L .S .D .LS .LSD) 2 3
;*
;* Searching for software pipeline schedule at ...
;* ii = 4 Register is live too long
;* |88| -> |89|
;* |80| -> |85|
;* |63| -> |69|
;* |66| -> |74|
;* |64| -> |74|
;* |65| -> |69|
;* |68| -> |78|
;* |81| -> |86|
;* ii = 4 Schedule found with 4 iterations in parallel
;*
;* Register Usage Table:
;* +---------------------------------+
;* |AAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBB|
;* |0000000000111111|0000000000111111|
;* |0123456789012345|0123456789012345|
;* |----------------+----------------|
;* 0: | *** ******* |*** ********* *|
;* 1: |************ |******** **** *|
;* 2: |***** ****** |************** *|
;* 3: |*** ******* |************* *|
;* +---------------------------------+
;*
;* Done
;*
;* Epilog not entirely removed
;* Collapsed epilog stages : 2
;*
;* Prolog not entirely removed
;* Collapsed prolog stages : 2
;*
;* Minimum required memory pad : 0 bytes
;*
;* For further improvement on this loop, try option -mh24
;*
;* Minimum safe trip count : 1
;*----------------------------------------------------------------------------*< br /> ;* SINGLE SCHEDULED ITERATION
;*
;* LOOP:
;* 0 LDW .D2T2 *B5++(8),B11 ; |64|
;* || LDW .D1T1 *A10++(8),A4 ; |65|
;* 1 LDW .D1T1 *A9++(8),A6 ; |63|
;* || LDW .D2T2 *B6++(8),B8 ; |66|
;* 2 NOP 3
;* 5 MV .S1 A4,A11 ; |65| Split a long life
;* 6 MPYH .M1 A6,A11,A3 ; |69|
;* || MPY .M2 B11,B8,B10 ; |73|
;* 7 MPY .M1 A6,A11,A0 ; |68|
;* || MPYH .M2 B11,B8,B3 ; |74|
;* 8 MV .L2 B11,B9 ; |64| Split a long life
;* || MV .S2 B8,B12 ; |66| Split a long life
;* || MPYLH .M1 A6,A11,A7 ; |71|
;* 9 MPYHL .M1 A6,A11,A5 ; |70|
;* || MPYHL .M2 B9,B12,B2 ; |75|
;* || [ B0] ADD .S2 0xffffffff,B0,B0 ; |88|
;* 10 ADD .S1 A0,A3,A0 ; |78|
;* || [ B0] B .S2 LOOP ; |89|
;* 11 ADD .L2 B10,B3,B7 ; |80|
;* || STH .D1T1 A0,*A8++(8) ; |83|
;* 12 MPYLH .M2 B9,B12,B13 ; |76|
;* 13 ADD .L1 A5,A7,A4 ; |79|
;* 14 ADD .L2 B2,B13,B7 ; |81|
;* || STH .D2T1 A4,*B4++(8) ; |84|
;* || STH .D1T2 B7,*-A8(4) ; |85|
;* 15 STH .D2T2 B7,*-B4(4) ; |86|
;* ; BRANCH OCCURS ; |89|
;*----------------------------------------------------------------------------*< br /> L1: ; PIPED LOOP PROLOG

MV .L1 A4,A9
|| SUB .S1X B4,1,A1
|| ADD .L2 0x2,B6,B4 ; |60|
|| ADD .S2X 0x4,A6,B6 ; |57|
|| LDW .D1T1 *A10++(8),A4 ; |65| (P) <0,0>
|| LDW .D2T2 *B5++(8),B11 ; |64| (P) <0,0>

LDW .D1T1 *A9++(8),A6 ; |63| (P) <0,1>
|| LDW .D2T2 *B6++(8),B8 ; |66| (P) <0,1>

STW .D1T1 A14,*-A0(16)
|| MV .S1X B3,A14
|| B .S2 LOOP ; |89| (P) <0,10>

MVK .S2 0x1,B1 ; init prolog collapse predicate
|| MVK .S1 0x2,A2 ; init prolog collapse predicate
|| STW .D1T1 A11,*-A0(24)
|| MV .L1X B8,A12
|| STW .D2T2 B10,*+SP(20)

;** --------------------------------------------------------------------------*
LOOP: ; PIPED LOOP KERNEL

MPYLH .M2 B9,B12,B13 ; |76| <0,12>
|| MV .S2 B8,B12 ; |66| <1,8> Split a long life
|| MV .L2 B11,B9 ; |64| <1,8> Split a long life
|| MPYLH .M1 A6,A11,A7 ; |71| <1,8>
|| [ A1] LDW .D2T2 *B5++(8),B11 ; |64| <3,0>
|| [ A1] LDW .D1T1 *A10++(8),A4 ; |65| <3,0>

ADD .L1 A5,A7,A4 ; |79| <0,13>
|| [ B0] ADD .S2 0xffffffff,B0,B0 ; |88| <1,9>
|| MPYHL .M1 A6,A11,A5 ; |70| <1,9>
|| MPYHL .M2 B9,B12,B2 ; |75| <1,9>
|| MV .S1 A4,A11 ; |65| <2,5> Split a long life
|| [ A1] LDW .D1T1 *A9++(8),A6 ; |63| <3,1>
|| [ A1] LDW .D2T2 *B6++(8),B8 ; |66| <3,1>

[!A2] STH .D1T2 B7,*-A8(4) ; |85| <0,14>
|| [!A2] STH .D2T1 A4,*B4++(8) ; |84| <0,14>
|| ADD .L2 B2,B13,B7 ; |81| <0,14>
|| ADD .S1 A0,A3,A0 ; |78| <1,10>
|| [ B0] B .S2 LOOP ; |89| <1,10>
|| MPY .M2 B11,B8,B10 ; |73| <2,6>
|| MPYH .M1 A6,A11,A3 ; |69| <2,6>

[ B1] SUB .S2 B1,1,B1 ; <0,15>
|| [ A2] SUB .S1 A2,1,A2 ; <0,15>
|| [ A1] SUB .L1 A1,1,A1 ; <0,15>
|| [!A2] STH .D2T2 B7,*-B4(4) ; |86| <0,15>
|| ADD .L2 B10,B3,B7 ; |80| <1,11>
|| [!B1] STH .D1T1 A0,*A8++(8) ; |83| <1,11>
|| MPYH .M2 B11,B8,B3 ; |74| <2,7>
|| MPY .M1 A6,A11,A0 ; |68| <2,7>

;** --------------------------------------------------------------------------*
L3: ; PIPED LOOP EPILOG

MV .S1X SP,A9 ; |92|
|| STH .D1T2 B7,*-A8(4) ; |85| (E) <3,14>
|| MPYLH .M2 B9,B12,B5 ; |76| (E) <3,12>

LDW .D2T2 *+SP(28),B12 ; |92|
|| LDW .D1T1 *+A9(8),A11 ; |92|
|| MV .S2X A14,B3 ; |92|
|| ADD .S1 A5,A7,A4 ; |79| (E) <3,13>

ADD .S2 B2,B5,B7 ; |81| (E) <3,14>
|| STH .D2T1 A4,*B4++(8) ; |84| (E) <3,14>

LDW .D1T1 *+A9(12),A12 ; |92|
|| MV .S2X A12,B4
|| STH .D2T2 B7,*-B4(4) ; |86| (E) <3,15>

LDW .D2T2 *+SP(20),B10 ; |92|
|| LDW .D1T1 *+A9(16),A14 ; |92|
|| MVC .S2 B4,CSR ; interrupts on

RET .S2 B3 ; |92|
|| LDW .D2T2 *+SP(24),B11 ; |92|
|| LDW .D1T1 *+A9(4),A10 ; |92|

LDW .D2T2 *++SP(32),B13 ; |92|
NOP 4
; BRANCH OCCURS ; |92| ; .endproc


______________________________
New Code Sharing Section now Live on DSPRelated.com. Learn about the Reward Program for Contributors here.



(You need to be a member of c6x -- send a blank email to c6x-subscribe@yahoogroups.com )